Do any Z-machine interpreters not support Unicode translation tables?

andrewj · May 24, 2025, 10:45am

I believe that is the best way to go, and I’ll probably do the same thing for Hintweaver (which currently assumes that a provided Unicode table will always work).

Another idea is a new in-story directive like (new unicode table) instead of (or in addition to) the command-line option.

Mike_G · May 24, 2025, 3:39pm

Well, I’m off to write a zcode game that communicates entirely through emojis.

Draconis · May 24, 2025, 8:39pm

Sadly, the Z-machine does not currently support emojis. Maybe in spec 1.2!

(It assumes all Unicode codepoints fit into a 16-bit word, so BMP only.)

Draconis · May 24, 2025, 8:40pm

It would be cool if some compilation directives could be included in the source text (e.g. including one file from another), but sadly the compiler architecture doesn’t currently support it.

Draconis · May 24, 2025, 10:35pm

This is now implemented on the z-unicode branch:

All Unicode characters found in dictionary words are added to the end of the Unicode translation table
The table is normally preloaded with the default values; the --no-zscii or -Z command-line option makes it start with an empty table instead
The Unicode translation table (and header extension table) is only included in the output if it’s non-empty and differs from the default one

For example:

(current player #player)
(#player is #in #room)
(room #room)
(look #room) A room built to hold ăppłes and ĂPPŁES.

#apple
(* is #in #room)
(name *) ăppłe
(an *)
(item *)
(* is handled)

Here, ă and ł are added to ZSCII, while Ă and Ł are not, since they only appear in output (a room description), not input (an object name).

As before, Dialog still doesn’t handle casing properly for characters outside ASCII, so if an object’s name is capitalized, you’ll have to explicitly include a lowercase synonym. That shouldn’t be too hard of a change, though. Dialog already includes a database of Unicode case pairs for Å-machine purposes, which it could consult while building the dictionary.

Mike_G · May 25, 2025, 2:28am

I wasn’t serious.

andrewj · May 25, 2025, 4:55am

It could output them as surrogate pairs, but that assumes interpreters support surrogate pairs, which they probably don’t :-(

Draconis · May 25, 2025, 5:02am

And unfortunately it wouldn’t work for input, because the Unicode-to-ZSCII mapping also assumes 16-bit codepoints. Bad news for anyone who wants to play IF in Deseret!

andrewj · May 25, 2025, 5:07am

There may be a way…

Draconis · May 25, 2025, 5:09am

Oh? Do tell! I don’t know how much audience there would be for IF with non-BMP input, but I’m always in favor of fewer restrictions.

andrewj · May 25, 2025, 5:39am

Just surrogate pairs again – but yeah, not sure any interpreter would bother doing it.

Some info here:

Mike_G · May 26, 2025, 12:12am

There’s always a way…the question is how much work is it and what are the tradeoffs.

Sure, although this adds a number of pain points. For non-UTF-16 systems (aka anything but Windows), it’s extra work to encode/decode the input/output. It raises the issue of being unable to type that last character of input even though there’s one byte in the buffer because it would take two bytes. Dealing with surrogates raises other edge cases dealing with missing halves, or missing entries in the extra characters table too.

Dannii · May 26, 2025, 6:09am

The point where you’re considering non-standard z-code hacks like surrogate pairs is the point where you should just switch to Glulx.

Draconis · June 12, 2025, 5:23am

Oh, I almost forgot—I implemented a whole system for adding lowercase synonyms, then discovered the comments in the code are wrong. Dialog actually already does that.

So I reverted that, and casing works fine. The one thing that doesn’t work is calling (uppercase) and then printing a Unicode character. I haven’t found an elegant way of handling this; probably the best would be a table of lowercase-to-uppercase mappings in ROM right after the Unicode translation table, which would be consulted when printing an extended ZSCII character. (Wouldn’t handle any Unicode characters outside ZSCII, though.)