toki pona has a few constraints, without which the game of toki pona is rendered silly. Foremost is that there are only about 125 or so morphemes (fairly non-bound). This isn’t as much of a problem and I think tp community proposals can stick to this.
The next idea is that numbers, dates and so on are lacking, as if this were the language of an ancient tribe (despite missing a fully formed system of naming plants, animals and extended family relations). This is problematic for working with data on computers. Numbers and Dates are basic types, without them certain computer experiments are harder than necessary.
I’m writing a parser and I need to make a few modifications to make tp easily parsable. My parser does a two pass parse– 1st phase I normalize the text & make best efforts to add punctuation as described below. It is difficult and error prone. If I didn’t have to do this normalization, the parser would parse more text on the 1st try and get better glosses on the 1st try.
1) Phrasal compounds are joined with dashes. jan-pona. jan-pi-sijelo-pona.
2) Prefix numbers with #, e.g. #wan If it is a two word number, it is hyphenatied, e.g. #wan-tu
3) Direct quotes are in << >>. e.g. jan li toki e << toki! >> (everyone else appears to use English convention of single or double quotes– but I need an escape character, see below)
4) Prepositional phrase must start with , e.g. mi li, lon ma ni. jan li moku, kepeken ilo.
5) Non toki pona text is escaped with double quotes. mi toki kepeken toki “English”
We have compound words. We pretend we don’t, but we do. These are lexemes, phrasal compound words. Compound words are joined by -’s
jan-pona = friend.
jan-pi-sijelo-pona = doctor.
Why? Because you can’t accurately machine gloss jan pona to friend. Why should we pretend that jan-pona is anything but a phrasal compound and gloss it as good person, healthy person, friend, etc. Without hyphens, I have to gloss using a list of alternatives. With hyphens, I can dispense with a list of alternatives and home in on a single gloss.
jan li ike li tawa jan pi sijelo pona li kama jan pona.
jan li ike li tawa jan-pi-sijelo-pona li kama jan pona.
We have “rovers”/syntactical infix. I don’t know what these are really called.
jan-mute-pi-sijelo-pona = doctors.
jan-pi-sijelo-pona-mute = doctors.
We need numbers. The shall be words prefixed by #
I will have to look up 3,4,6,7,8,9 from the forum. I know there are many proposals, I’ll look for community ones and then I plan to implement the ones that are base 10, don’t introduce new words, positional and reasonably efficient, e.g. no worse than English in expressing large numbers.
Some numbers are legacy numbers with some degree of officialness and will have to be supported.
#wan-tu-tu = 4
#luka-luka = 10
#MMLW = 20+20+5+1
But I don’t recommend using legacy numbers if you are trying to communicate.
Watch this space!
We need direct quotes. They shall be wrapped in << >> (or the « » if you can find those keys on the keyboard)
jan li toki e << mi jo e soweli! >>
He said, “I have a dog.”
I hope I don’t regret this choice because < and > mean something in HTML and might cause problems in some content management systems. Oh well.
Anything in direct quotes markers is syntactically a content word.
We need commas.
People currently add commas before or after la, but actually we don’t need them there. I have no opinion about what people do there. Also I have no opinion about commas in pi-phrases.
mi pali, kepeken ilo sona, lon tomo, pali tawa mani.
I work with computers in the office for money.
When there is nothing to distinguish a preposition from a content word, it is valid to parse every word after pali as a string of adverbs:
mi pali(kepeken ilo sona lon tomo pali tawa mani).
Humans can realize that is unlikely, but a machine can’t. Humans can parse invalid toki pona and realize that someone is mixing Russian and English and toki pona rules and, with some effort, realize the intended correct toki pona. This sort of parsing is a huge effort to implement. On the other hand, commas make parsing mechanically effortless.
We need an escape character
The corpus texts are full of mixed language material, from accidents in transliteration to people just trying to communicate. After transliterating to toki pona, normally the original is unrecognizable– it might as well be a completely new word. So toki pona texts that interact with the real world, will need to have foreign text. And that text should be in double quotes.
nimi mi li “Matthew Martin” li jan Mato.
Anything in double quotes syntactically is a content word.
The current date system is something like
tenpo suno wan, mun wan, sike suno wan = 1/1/1
You can find some variant of this on the wikia for toki pona. It uses legacy numbers and is to cumbersome for anyone to want to use it.
I’m going to recommend this format: y-m-d
Because it will be easier to sort.
Also, for this to work, numbers have to be reasonably efficient and be able to cope with numbers from 1 to 2015.
Watch this space!