As my work on Dialog continues, I’m currently rigging up the Å-machine to accept non-BMP characters in its character set. Unfortunately, though, this won’t work on the Z-machine, because the current spec limits it to the Basic Multilingual Plane—it can’t handle characters above U+FFFF.
I’m curious if this might change at some point. I know the Z-machine spec hasn’t been updated in over a decade now and the format is generally stable; but supporting Unicode at all is a post-Infocom change. It would be nice to go all the way.
Since the Z-machine works with 16-bit words, using UTF-16 seems the most promising. It seems like the changes (specification-wise) would be small:
Either specify that you can use two @print_unicode instructions to print a surrogate pair, or create a new @print_unicode2 (and @check_unicode2) opcode that takes two words instead of one
Specify that a certain bit being set in Flags 2 or Flags 3 means the Unicode translation table contains 3-byte values instead of 2-byte
In the absence of a gestalt system, games could check for support via the Standard version number in the header, and error out if necessary (in particular, if they need non-BMP codepoints in the translation table).
What do people think? Is there interest in a hypothetical Standard 1.2 (I know @Dannii was working on this at some point)? Or should the Z-machine be considered a legacy format at this point that won’t see any further development?
I wouldn’t mind seeing an extension - but I think it needs a lot of careful thought. Using Flags3 (not flags2) is fine and a three byte translation table sounds doable. I think it better to NOT specify input or output as UTF-16 or any other encoding. Leave that up to the interpreter. That way the standard doesn’t need to involve itself with surrogate pairs and all that ugliness except to consider them invalid.
At this point in time the Z-Machine is a format largely for the point of legacy platforms. Non-BMP isn’t going to run on them. Extending the Z-Machine and keeping legacy platforms working are opposing goals.
I don’t think this can be gated by standard 1.2. As you note, we’ve had 1.2 proposals kicking around for a while, so there’s a concept of what 1.2 “should” look like. That would have to be resurrected so there’s complete agreement on what a claim of 1.2 actually means, and I’m not sure that’s feasible.
That includes adding entries to Flags 3, since that’s nominally controlled by the standard.
One backward-compatible possibility is to define some new EXT opcodes. For example, one could be used to “upgrade” the Unicode translation table: pass it the address of the 3-byte-entry table, and now the interpreter uses that as the translation table. It’d have to also include a pointer for the interpreter to write out a value saying “yes, I handled this opcode”, so you could detect whether it is supported. Then decide whether to halt or continue in “legacy” mode.
On the @print_unicode side, same deal. A new EXT opcode which takes two operands (so no need for UTF-16/surrogate pairs), and, as with the translation table, a pointer to be written to if it’s supported. This would make writing out Unicode values a bit verbose in the code, but doable. As in, if you really wanted to you could do this (wonderful pseudo-code mix of C and Inform):
! 😀 is U+1F600
top = $0001;
bottom = $f600;
*result = 0;
@print_full_unicode top bottom result;
if *result == 0 {
print ":)"
}
Is this worth it? I dunno. But it’d be pretty easy to add support for these to an interpreter.
The main takeaway is that you’d just have to check for success/failure on each EXT call. Or, alternatively (and probably better), tie the two together: make the “upgrade Unicode table” a generic “check for extended Unicode” that takes an optional Unicode translation table, but always “returns” a value saying “yes I support all these Unicode EXT opcodes”. That way you don’t have to have the new @print_unicode-style opcode “return” anything. You test once on startup and know it’s available or not.
Are they necessarily incompatible ones, though? A Commodore 64 will never be able to print the entire Basic Multilingual Plane, but that capability was still added to the Z-machine because Frotz and Parchment can.
Switching to Glulx just for non-BMP characters is a lot of work for very little payoff, though. Dialog’s current architecture is based around 16-bit words, so it couldn’t take much advantage of Glulx’s increased word size; it would increase the amount of ROM available for routines and strings, certainly, but for bigger Dialog projects, the Å-machine is the better target.
(A new VM would mean rewriting 4,750 lines of assembly code, so the benefits have to be pretty substantial to be worth that effort! And the Glulx style system is a notable downgrade from the Z-machine’s.)
Is UTF-16 or UTF-32 a viable option for the Å-machine?
But I’m not really in the Z-Machine interpreter game anymore (ZVM isn’t even used by Parchment now.) If the community wants to try this, go ahead. Maybe there’ll be enough momentum for it to take off this time.
I don’t think there are any interpreters in current use that implement any version of 1.2. The only thing my proposal contained was a @gestalt opcode, so a minimum implementation is easy to make.
The lesson I learnt from the Standard 1.1 (which basically no-one has ever used the new features of) is that it’s not a great approach to add things just because you can, or you think it might be a good idea, in the abstract. If there’s no real interest in using features from authors, it’s just pointless work. If we start seeing games appear built in Dialog that target the A-machine only, because they use non-BMP characters, then that would be the time to think about extending the Z-Machine too. Until then I don’t see the pont.
The Å-machine actually has no equivalent to @print_unicode at all; all input and output uses a single-byte game-specific encoding defined by a Z-machine-style Unicode translation table. The only difference is that table has three-byte entries instead of two-byte.
(That’s how it can get non-BMP characters on a Commodore 64. The C64 bundling tool looks at the Unicode translation table and converts it into an appropriate C64 font containing those characters.)