Do any Z-machine interpreters not support Unicode translation tables?

I believe that is the best way to go, and I’ll probably do the same thing for Hintweaver (which currently assumes that a provided Unicode table will always work).

Another idea is a new in-story directive like (new unicode table) instead of (or in addition to) the command-line option.

Well, I’m off to write a zcode game that communicates entirely through emojis. :wink:

2 Likes

Sadly, the Z-machine does not currently support emojis. Maybe in spec 1.2!

(It assumes all Unicode codepoints fit into a 16-bit word, so BMP only.)

It would be cool if some compilation directives could be included in the source text (e.g. including one file from another), but sadly the compiler architecture doesn’t currently support it.

This is now implemented on the z-unicode branch:

  • All Unicode characters found in dictionary words are added to the end of the Unicode translation table
  • The table is normally preloaded with the default values; the --no-zscii or -Z command-line option makes it start with an empty table instead
  • The Unicode translation table (and header extension table) is only included in the output if it’s non-empty and differs from the default one

For example:

(current player #player)
(#player is #in #room)
(room #room)
(look #room) A room built to hold ăppłes and ĂPPŁES.

#apple
(* is #in #room)
(name *) ăppłe
(an *)
(item *)
(* is handled)

Here, ă and ł are added to ZSCII, while Ă and Ł are not, since they only appear in output (a room description), not input (an object name).

As before, Dialog still doesn’t handle casing properly for characters outside ASCII, so if an object’s name is capitalized, you’ll have to explicitly include a lowercase synonym. That shouldn’t be too hard of a change, though. Dialog already includes a database of Unicode case pairs for Å-machine purposes, which it could consult while building the dictionary.

3 Likes

I wasn’t serious. :slight_smile:

It could output them as surrogate pairs, but that assumes interpreters support surrogate pairs, which they probably don’t :-(

1 Like

And unfortunately it wouldn’t work for input, because the Unicode-to-ZSCII mapping also assumes 16-bit codepoints. Bad news for anyone who wants to play IF in Deseret!

There may be a way…

Oh? Do tell! I don’t know how much audience there would be for IF with non-BMP input, but I’m always in favor of fewer restrictions.

Just surrogate pairs again – but yeah, not sure any interpreter would bother doing it.

Some info here:

There’s always a way…the question is how much work is it and what are the tradeoffs.

Sure, although this adds a number of pain points. For non-UTF-16 systems (aka anything but Windows), it’s extra work to encode/decode the input/output. It raises the issue of being unable to type that last character of input even though there’s one byte in the buffer because it would take two bytes. Dealing with surrogates raises other edge cases dealing with missing halves, or missing entries in the extra characters table too.

1 Like

The point where you’re considering non-standard z-code hacks like surrogate pairs is the point where you should just switch to Glulx.

Oh, I almost forgot—I implemented a whole system for adding lowercase synonyms, then discovered the comments in the code are wrong. Dialog actually already does that.

So I reverted that, and casing works fine. The one thing that doesn’t work is calling (uppercase) and then printing a Unicode character. I haven’t found an elegant way of handling this; probably the best would be a table of lowercase-to-uppercase mappings in ROM right after the Unicode translation table, which would be consulted when printing an extended ZSCII character. (Wouldn’t handle any Unicode characters outside ZSCII, though.)