Paragraphs and machine assisted conlang work

When you are teaching a machine to do something with a language, you get a surprising set of challenges that are not mentioned much in your traditional reference grammar.

On the internet, people post texts and line breaks are unreliable indicators of paragraphs. White space can appear for a variety of reasons.

e.g. Line break in middle of sentence– exceedingly common.
jan ilo li wile

e.g. Double space in middle of sentence– enough to make a double space an unlikely paragraph mark.
jan ilo li wile


e.g. Line break is actually HTML. Sometimes HTML p tags aren’t really paragraphs.
jan ilo li wile
< br >

e.g. Tab starting a paragraph is actually a few spaces. Sometimes those spaces disappear and sometimes they are just spaces.
   jan ilo li wile pali.

Explicit paragraph marker, for example, four dashes centered on a page, like the divider you see in some novels between “scenes”.
Assume double space is a paragraph. This is wrong a lot of the time.
Synthetic paragraphs. Apply rules such as this: Any sentence ending in ni: is in the same paragraph as the following sentence. Any vocative followed by a sentence is in the same paragraph. Quoted text initiates a new paragraph. However, I suspect this would be a lot of work and would fail, resulting in too many synthetic paragraphs that ‘consume’ the entire text.
Ignore the problem and turn everything into a series of sentences, or a huge single paragraph.
Two parsing modes. Strict and Loosey-Goosey. In strict mode, paragraphs are started by tabs. In Loosey-Goosey mode, tabs, blank lines are assumed to be paragraph breaks and it is just accepted that this will be wrong a lot of the time.

This entry was posted in machine assisted conlanging, toki pona. Bookmark the permalink.

Comments are closed.