Puctuating toki pona- Community Proposal

toki pona has a few constraints, without which the game of toki pona is rendered silly. Foremost is that there are only about 125 or so morphemes (fairly non-bound). This isn’t as much of a problem and I think tp community proposals can stick to this.

The next idea is that numbers, dates and so on are lacking, as if this were the language of an ancient tribe (despite missing a fully formed system of naming plants, animals and extended family relations). This is problematic for working with data on computers. Numbers and Dates are basic types, without them certain computer experiments are harder than necessary.

I’m writing a parser and I need to make a few modifications to make tp easily parsable. My parser does a two pass parse– 1st phase I normalize the text & make best efforts to add punctuation as described below. It is difficult and error prone. If I didn’t have to do this normalization, the parser would parse more text on the 1st try and get better glosses on the 1st try.

Summary
1) Phrasal compounds are joined with dashes. jan-pona. jan-pi-sijelo-pona.
2) Prefix numbers with #, e.g. #wan If it is a two word number, it is hyphenatied, e.g. #wan-tu
3) Direct quotes are in << >>. e.g. jan li toki e << toki! >> (everyone else appears to use English convention of single or double quotes– but I need an escape character, see below)
4) Prepositional phrase must start with , e.g. mi li, lon ma ni. jan li moku, kepeken ilo.
5) Non toki pona text is escaped with double quotes. mi toki kepeken toki “English”

We have compound words. We pretend we don’t, but we do. These are lexemes, phrasal compound words. Compound words are joined by -’s

jan-pona = friend.
jan-pi-sijelo-pona = doctor.

Why? Because you can’t accurately machine gloss jan pona to friend. Why should we pretend that jan-pona is anything but a phrasal compound and gloss it as good person, healthy person, friend, etc. Without hyphens, I have to gloss using a list of alternatives. With hyphens, I can dispense with a list of alternatives and home in on a single gloss.

Unhyphenated.
jan li ike li tawa jan pi sijelo pona li kama jan pona.

Hyphenated.
jan li ike li tawa jan-pi-sijelo-pona li kama jan pona.

We have “rovers”/syntactical infix. I don’t know what these are really called.

jan-mute-pi-sijelo-pona = doctors.
jan-pi-sijelo-pona-mute = doctors.

We need numbers. The shall be words prefixed by #
#ala
#wan
#tu

#luka

I will have to look up 3,4,6,7,8,9 from the forum. I know there are many proposals, I’ll look for community ones and then I plan to implement the ones that are base 10, don’t introduce new words, positional and reasonably efficient, e.g. no worse than English in expressing large numbers.

Some numbers are legacy numbers with some degree of officialness and will have to be supported.

#wan-tu-tu = 4
#luka-luka = 10
#MMLW = 20+20+5+1

But I don’t recommend using legacy numbers if you are trying to communicate.

Watch this space!

We need direct quotes. They shall be wrapped in << >> (or the « » if you can find those keys on the keyboard)

jan li toki e << mi jo e soweli! >>
He said, “I have a dog.”

I hope I don’t regret this choice because < and > mean something in HTML and might cause problems in some content management systems. Oh well.

Anything in direct quotes markers is syntactically a content word.

We need commas.
People currently add commas before or after la, but actually we don’t need them there. I have no opinion about what people do there. Also I have no opinion about commas in pi-phrases.

mi pali, kepeken ilo sona, lon tomo, pali tawa mani.
I work with computers in the office for money.

When there is nothing to distinguish a preposition from a content word, it is valid to parse every word after pali as a string of adverbs:

mi pali(kepeken ilo sona lon tomo pali tawa mani).

Humans can realize that is unlikely, but a machine can’t. Humans can parse invalid toki pona and realize that someone is mixing Russian and English and toki pona rules and, with some effort, realize the intended correct toki pona. This sort of parsing is a huge effort to implement. On the other hand, commas make parsing mechanically effortless.

We need an escape character
The corpus texts are full of mixed language material, from accidents in transliteration to people just trying to communicate. After transliterating to toki pona, normally the original is unrecognizable– it might as well be a completely new word. So toki pona texts that interact with the real world, will need to have foreign text. And that text should be in double quotes.

nimi mi li “Matthew Martin” li jan Mato.

Anything in double quotes syntactically is a content word.

Dates
The current date system is something like

tenpo suno wan, mun wan, sike suno wan = 1/1/1

You can find some variant of this on the wikia for toki pona. It uses legacy numbers and is to cumbersome for anyone to want to use it.

I’m going to recommend this format: y-m-d
S1-M1-T1
Because it will be easier to sort.

Also, for this to work, numbers have to be reasonably efficient and be able to cope with numbers from 1 to 2015.

Watch this space!

This entry was posted in machine assisted conlanging, toki pona. Bookmark the permalink.

2 Responses to Puctuating toki pona- Community Proposal

  1. John E. Clifford says:

    Well, I’m not sure how community generated these proposals are (but you are in the community, so …)
    Dash compounds. This is tempting, especially since it usually marks a difference between a generated compound and one that can be viewed as transformational, very useful. I would have said doctors were jan pi pona sijelo and that jan pi sijelo pona were just fitness nuts, but then I noticed that ‘jan pi pona sijelo’ came out to be fitness nuts, too, parallel to ‘jan pi pona lukin’ for beautiful/handsome people. I know, Context, context, context!
    The ‘mute’, etc. infix seems to be a clarification transformation or a shift for the position of Quantity modifiers. It needs more work but dash doesn’t seem to help.
    The use of pound/hashtag/number to mark numbers only arises when we have dual use words like ‘mute’ and ‘ali’ and just maybe ‘luka’ and ‘ala’. If we had nice clean numbers (even if, like ‘wan’ and ‘tu’, they occasionally had other uses in very different slots), there would be no problems. I suspect getting those will be the last change to happen in tp.
    Direct quotes. We already have a mechanism (and a growing habit) to deal with them, but I like the use of angles for some purpose (as distinct from metalanguage quotes and foreign quotes and scare quotes and whatever else is going on). I would restrict angles to quotes of putatively good tp and parse it on the side, though.
    Commas before PP are very handy, further ones to ort out overlong ‘pi’ phrases and to help group iterated ‘la’ phrases are also possible. I suspect there are even more possible problems where commas (or some other marks) would help and that building parsers will help find them.
    Escape character (foreignquotes), yes! Strictly with ‘nimi/kalama’, but that regularly dropped in writing. (The tp part of your sample is wrong, of course, but , alas, grammatical.)
    Dates are usually given with ordinals, not cardinals (which can get mixed with other notions) but, if we had real numbers we could simplify somehow. The problems with addresses of all sorts and identification numbers of all sorts also remains to be solved when we get real numbers.
    Watch this space indeed.

    • matthewdeanmartin says:

      Good points. I think everything I’ve ever called a “community proposal” was a proposal of a single person– such as the many number, writing systems. A few people proposed something and then incorporated feedback. There is nothing like a “toki pona language board” yet, so no community proposals in the sense of a bunch of people collectively suggesting something.

      Direct quotes are a parser writer’s devil. And foreign text. And elisions. So far, adding more punctuation & more homogeneous uses of punctuation seems to help machine parsing. (by homogeneous uses I mean, double quotes just for foreign text, not direct quotes, “so-called” callouts, and foreign text, and so on)