Bag of Words Syntax

The CountVecorizor is the most audacious model of human language. It counts the words and converts the entire sentence into a vector of word counts for each different word in the sentence (or text) and by absence, zeros for all other words in the dictionary. When the number of words in the dictionary is small, the fact that one word was left out holds more information. This is important for small languages.

Toki pona’s adjective chains and pi chains are sort of like this. We clearly can project a chain-of-things or graph-of-things model with head words and successive groupings, but sometimes there are so many possible graphs or groupings that you might as well view the structure as a bag of words. A “pretty little girls school” has plausible interpretations for all possible graphs and groupings, might as well look at as a bag of words that is about “pretty, little, girls” and “school” and not about octopi.

So a hypothetical small language that uses bag of words explicitly might work like this:

The basic unit is a word. Compound words are created per morphology and are explicitly joined. If we didn’t we’d have the jan pona problem, pairs of words that smell like lexemes, but look like two words joined only by syntax.

The basic unit of syntax is a bag of words. One would need to see where a bag of word starts or ends, so you’d need audible parenthesis, at least on one side and on sentence start and finish. These markers should be consistently left or right branching but not both. If you didn’t, you’d get toki pona parsing problem with sentence starts (no token, presumably uses English-like inflection, or in writing, inaudible punctuation), la-phrase parsing (branches the wrong way)

The separator for each bag of words can act as syntactic markers for what ever the metaphysical obsessions are, e.g. obsessing on who did it, who was on the receiving end (syntactic alignment), obsessing on who is experiencing it (an alternate to subject-object), what-is-changing (polysynthetic languages where the verb takes on all duties).

Bags of bags of words are unordered. This means we’d need a lot of bag markers.

The syntax explicitly allows parsers to project any structure you’d like over words (i.e. you can force English adjective order over what are semantically adjectives), but two bags of words with the same words are strongly equal. Any pattern detectable in one person or another’s bag of words is purely an implementation detail. If this rule isn’t explicit, our brains will always create sentences with excess structure and find structure where there is none.

Obviation is another audacious mechanism. When collection of larger structures is undefined, a marker is made up to indicate that they reference the same thing. Anyhow, it just occurred to me as about the sloppiest way to coordinate a sentence.

x = BoW
BoW, BoW, x BoW

So a whole language can be made with:

x words, say 100
y bag markers, say 50-100

Likely usage pattern
Beginners would have large bags of words and a discourse would have few bags. Experts would have many small bags of words and extensive obviation markers. Over time, the community would accidentally project more familiar structures over these bags of words and we’d have (deep) tree based syntaxes all over again.

This entry was posted in conlang design, toki pona. Bookmark the permalink.

Comments are closed.