So I’m trying to create a database schema for dictionary work with toki pona. I’d like it to be re-usable for other languages, so I’m think about what concepts in lexicography are useful and universal.
Lemmatization. Picking the dictionary form from all forms possible. In the case of toki pona, tomo pi telo nasa is a common set phrase for “bar.” Depending on who you ask, a large bar could be “tomo suli pi telo nasa” or “tomo pi telo nasa suli.” “telo nasa wawa” is equivalent to “telo wawa nasa” I’m not sure which is the better form for a dictionary.
Lexical Gaps. Something translate poorly. Good examples are things like usual foods, bad examples include things like a particular emotional stage.
And rephasing wikipedia, there are all these categories of utterances that have overlapping qualities with words, in the sense of things you need to memorize and might want in a dictionary, regardless to the particular linguistic taxonomy or mental model you have for language.
lexeme – a single word ignoring inflected forms.
polylexemic word group
Words, e.g., “cat”, “tree”.
Phrasal verbs, such as “put off” or “get out”. A Germanic language thing.
Separable verb — Phrase verbs that appear both as a single word and as a phrasal verb
Polywords, e.g., “by the way”, “inside out”.
Collocations, e.g., “motor vehicle”, “absolutely convinced”.
Institutionalized utterances, e.g., “I’ll get it”, “We’ll see”, “That’ll do”, “If I were you”, “Would you like a cup of coffee?”
Idioms, e.g., “break a leg”, “was one whale of a”, “a bitter pill to swallow”.
Sentence frames and heads, e.g., “That is not as…as you think”, “The problem was”.
Text frames, e.g., “In this paper we explore…; Firstly…; Secondly…; Finally …”.
noun-modifier semantic relations - cold virus vs a virus that is cold.
‘Free Combination’ ↔ ‘Bound Collocation’ ↔ ‘Frozen Idiom’