An experiment with parsing Japanese

lft · January 25, 2020, 10:24pm

It’s always tempting to allow low-level platform details to seep up into a high-level language, especially when it seems to solve a practical problem right now, at a low cost.

But there’s a good reason for maintaining a strict separation here: Dialog is not tied to a particular character encoding. This allows the same stories to run on the Z-machine, which uses the ZSCII character set (a peculiar 8-and-10-bit hybrid with 97 user-definable glyphs) and the current Å-machine, which uses an 8-bit encoding (ASCII + 128 user-definable glyphs). The same Dialog code can also run in the interactive debugger, which uses UTF-8 internally, and it could in principle be compiled to run on a UCS-2-based system (like Javascript, or perhaps a future version of the Å-machine in 32-bit mode).

If we expose codepoints to the high-level program, then they would be different depending on the platform. For instance, å is usually represented by character code 201 on the Z-machine, but it’s 229 in Unicode/UCS-2, and it can be any value in the range 128-255 on the current Å-machine. Therefore we would end up with either 1. stories and libraries that are tied to a particular platform, or 2. a compatibility layer (or “shim”) in every platform, translating codepoints back and forth between Unicode and the native encoding. In my opinion, both of these options are problematic.

When targeting the current version of the Å-machine, the compiler maps every character appearing in the source code to a unique byte value. The resulting 8-bit encoding scheme performs well on vintage hardware, but as a consequence, different storyfiles can have different character encodings. In fairness, Unicode does play a part in this: There is a table in the storyfile, mapping each non-ASCII character to a Unicode glyph. But interpreters can consult this table at startup, piece together a font based on it, and then throw the table away. And they can refuse upfront to run a story if it contains an unsupported glyph. There is never a need to print an arbitrary unicode character that wasn’t in the table, so there is never a need to keep a huge full-unicode font around at runtime. That would be necessary if codepoints were exposed to the high-level program.

That is why I think the cleanest option is to allow the Å-machine to run in one of two modes: a small memory model (16-bit words and 8-bit characters) and a large memory model (32-bit words and 16-bit characters). They should be fully compatible at the Dialog source-code level. The story author shouldn’t have to worry about what memory model to use. The compiler should just automatically select the most appropriate memory model based on the number of characters, words, and objects in the story. When the story grows too large, it won’t run on old hardware anymore—but the semantics of the language will remain exactly the same, regardless of how a particular character is represented at the bit level.