Corpus linguistics and little sins

Soooo, this toki pona project of mine. I’m parsing community generated texts. I’ve got incompatible goal. On one hand, I want a parser that will work. So if you type English, it should blow up. But if you type toki pona, it should parse it if it is beautiful, conservative toki pona. But if it is sloppy toki pona, I don’t really want to make a big deal of it. So you forgot to capitalize, forgot a period, forgot a closing quote, forgot the li, added a period instead of a comma before a li, and on. I’m not going to add new rules to try to deal with these. So at the moment, I normalize them. I just fix them.

Then there is this:

meli li tawa en tan lon palisa.

Ignoring what it might mean, it’s a compound propositional phrase, just like English, “No smoking in or around the school.” It could have been written:

meli li tawa lon palisa li tan lon palisa kin.

But that would just sound pedantic. Some stuff is sort of borderline.

How to you deal with noises?

jan li owi. => The guy said ouch!

I could fix it to:

jan li mu owi. => They guy made noise like ouch.

But that looks pedantic and I don’t like the word mu all that much. It sounds too much like a cow and you have to add another word to indicate the actual sound.

How do you deal with defective names?

nimi mi li nimi ‘jan Laowi’

Fixing them would make parsing easier, but would get in the way of communication, especially if a name is already well known.

This entry was posted in toki pona. Bookmark the permalink.

2 Responses to Corpus linguistics and little sins

  1. John E. Clifford says:

    Missing a ‘li’ IS a big deal, the others — mainly about punctuation — are less so (though some of the punctuations seem odd to me: commas before ‘li’, say, but I am a big fan of caps and quotes (see later comments)).)
    The expansions of Prep in PP is an ongoing concern, but this generally looks like a normal case of Collapse (though there are compound prepositions, I think). So, I wouldn’t want to generate it, but for some sorts of grammars the Prep en Prep form ought to turn out OK. What kind(s) are you working on? Of course, the fact that it is in a verb slot probably complicates matters slightly, by forcing it to be compound rather than collapsed.
    ‘mu’, strictly for animal sounds, is probably wrong in your case and it should be ‘a’ and then quoted with “foreign” quotes (I use double quotes here, but that is still in dispute): ‘jan li a “owi”‘, then, now becoming standard, just ‘jan li “owi”‘, I expect. But the quotes are important in writing (and maybe the ‘a’ in speaking?).
    ‘nimi mi li nimi ‘jan Laowi” has two problems; which are you out to fix? The ‘jan’ can just be dropped, with the note that we don’t address the person by ‘jan jan Laowi o’. The second requires some more information about how the name is actually meant to be pronounced.
    I guess the overall point is that the parse ought to fail in all these cases (except the compound Prep) but the response to each is different in some way

  2. Robert V Martin says:


    I have an interest in toki pona.

    I tried to join the Big Forum. I can’t register because my viewer cuts out the tiki pona phrase that I’m supposed to translate and also that bit about human beings.


    I registered on your smaller Forum before I read that adding the prefix “jan” would speed my confirmation.

    I used the Handle “Sakon Wiolen” a toki pona’ed version of “Saxon Violence” one of my Pseudonyms. Perhaps I should have went with “unpa utala” but that loses the pun aspect of the first name.

    Anyway, my name is on the member list, but I have no posting privileges and can’t PM you there of course.

    I seem to have a penchant for joining Fossil Forums. Please tell me, “Not Again.”

    While I’ve already went far beyond the scope of this particular Blog Article:

    A few concepts that seem to be missing and hard to state in toki pona:

    “Right and Left” (One would be sufficient…) Hmmm…I just now thought: “Clockwise and Counter-Clockwise” would also be pretty handy.

    There seems little way to express concepts like “Sin” “Shame” and “Cowardice”.

    Translating the “Hagakure” into toki pona would be an exercise in frustration.

    I also wonder—just supposing some Utopian Group decided to use toki pona exclusively—just for the fun of Supposing…

    It is very hard to imagine training an apprentice carpenter or blacksmith. I imagine if the group stuck to their “toki pona only”:

    A.} Many non-verbal qualifiers would come into play often without conscious intent;


    B.} The children would all grow up to be “Right-Brained Idiots.”

    Excuse—I really really dislike Right-Brainers…

    One last Question.

    One cannot “olin” inanimate objects. Isn’t this building someone’s prejudices into the language? I would argue very strongly that I “Love” my Truck. When my van of 16 years gave p the ghost a stood and shed many tears as they towed him away…

    And as a qualified Animist—I firmly believe that my van and my truck reciptrocate.

    Well anyway, if you could help me get into the Forums, I’ll present my myriad questions and opinions there.

    Thank you.


    Just in case you wonder:

    mi sona toki pona pona lilli. mi sona toki pona pona mute ala. mi sona li toki pona lipu (PDF) luka luka tenpo suno.