What is an erasing cross-compiler? Is it conlang related?

Imagine you had a script that took every English sentence and replaced “ain’t” with “isn’t”. That would be a one rule cross compiler. I think this idea is very powerful and applicable to evolving a conlang without actually teaching people fancy words like “isn’t”

All compilers create code that runs in a runtime or an actual computer. Natural languages, like zombies, run on brains. If a runtime can’t make sense of “isn’t”–maybe they’re from Alabama–then we would need a new runtime or would have to train the person to understand the new syntax.

In the world of web browsers, there is a programming language called JavaScript. It executes code in your browser and might do one of a million things. The JavaScript runtime is the only runtime that exists on virtually every machine in the world, so software developers are keen to write with it. But, they have to use the existing syntax, which is clunky and feature impoverished because the standard was set a very long time ago. People upgrade their browsers slowly. When a new feature is added to JavaScript, it needs to be added to about five different runtime flavors, Internet Explorer, Chrome, Firefox and so on. This takes huge amounts of time, so developers tend to target the oldest runtimes to make sure everyone’s browser understands the syntax well enough to execute. This barrier to progress has led to many strategies for dealing with the crappy, but frozen language specification. (JavaScript’s real politics are more complicated, I’m oversimplifying)

In the world of languages, people learn a language’s syntax and have a hard time dealing with innovations. Norwegian and Swedish are very close, but different enough for people to act like they are different languages. Changing a language by decree is pretty hard– there are so many runtimes–read brains–out there that are already set in their habits using the old syntax, vocabulary and so on.

In JavaScript, a smart guy working for Microsoft wrote a cross compiler and language called TypeScript. Cross compilers translate on language to another. The source language is a super set language.

A superset language is like the language of lawyers as compare to the language of children. A lawyer can understand a child’s language, but the child is going to be blown away by the vocabulary and fiercely complex discourse of a lawyer. In computing, C++ is a superset of C, any C++ compiler can run C code, but not vica versa, C++ has to many additions for C compilers to understand.

When a superset language compiles to a subset, things are erased or replaced with the corresponding idiom.

For example, in TypeScript, functions can signal to the compiler what type a variable is. This allows the compiler and tooling to catch mistakes.

function multiply(a, b) { return a * b)
// invalid
multiply(5,”cat”)
// valid
multiply(5, 5)

In TypeScript it is written

function multiply(a:number, b: number) { return a * b)

In this case, it is obvious that multiply doesn’t involve breeding cats, but only numbers. You can see here why it is called type erasing because some of the annotations and syntax were erased.

But we can’t give TypeScript to browsers. No browser understands TypeScript. So we compile it down to

function multiply(a, b) { return a * b)

Note that it is exactly the same as ordinary JavaScript . It is executable not only by all existing browsers, but human readable as well. Some cross compilers achieve their result by creating something that runs, but is otherwise wildly different from what handwritten code looks like.

How can we use this idea for a conlang? If the conlang is suitable for description with a formal grammar, then you can create a parse tree for a sentence. This allows you to do interesting things like colorizing certain words by part of speech, machine glossing to English, formatting as intralinear gloss and so on. But you still have to work in the constraints of the existing syntax.

When I wrote the parser for toki pona, I realized that it is extremely hard to identify prepositions and a few other situations. So I essentially, created a few annotations and conventions, such as using # for numbers, putting commas before all prepositions when used as prepositions, quotes for direct speech and quotes for foreign text and dashes for compound words. These narrow the number of alternative parsings down to a manageable point where an amateur can create a parse tree or toki pona.

So where is this heading? Wouldn’t it be cool to write a toki pona syntax that is a superset of exisiting toki pona, but compiles to ordinary toki pona readable by anyone with a basic understanding of toki pona?

With this sort of toki pona, you can write more tools for toki pona word processing and simplify certain steps. For example, one point of complexity is dealing with proper modifiers. There are thousands of cities and they are slightly different in each language. If they could be marked and written in regular English or French, then the toki pona compiler could automatically convert French to Kanse and Washington to Wasinton. This is just the tip of the iceberg, more in the next post.

This entry was posted in machine assisted conlanging, toki pona. Bookmark the permalink.

Comments are closed.