Updating the Z-Machine Standard Documents

Marvin · March 2, 2024, 6:36pm

A few of the problems with clarity in the 1.1 Standard come from how it was put together. Originally, 1.1 was a separate document which detailed how 1.1 differed from 1.0, and expected that the 1.0 Standard was understood. When I put together the full 1.1 Standard Document, I was reluctant to make too many changes to the text of either the 1.0 Standard or the 1.1 changes, since I didn’t actually write either.

At this point, though, it’s definitely worth trying to rewrite a few parts to make things easier to understand, as long as we can all agree on what was meant and that the new text makes it clear.

Marvin · March 3, 2024, 2:15am

§ 3.8.5.4.5 states:

Unicode characters U+0000 to U+001F and U+007F to U+009F are control codes, and must not be used.

Unicode characters U+D800 to U+DFFF are surrogates for use in UTF-16 encoding, and should also not be used.

Mike_G · March 3, 2024, 3:24am

Being pedantic, code point is more accurate than character.

Marvin · March 3, 2024, 3:38am

§ 3.8.5.4.1 states:

The interpreter is required to be able to print representations of every defined Unicode character under $0100 (i.e. of every defined ISO 8859-1 Latin1 character). If no suitable letter forms are available, textual equivalents may be used (such as “ss” in place of German sharp “s”).

Converting a single character into multiple characters causes problems with fixed-width fonts. I have no solution for this.

Marvin · March 3, 2024, 5:00am

@check_unicode is supposed to check for support for a given unicode character, but different fonts are likely (in some cases certain) to support printing for different ranges of characters.

Is it reasonable to change the description to say that, if possible, interpreters should return the value for the current font, in the current window?

Hanna · March 3, 2024, 7:45am

I think it’s slightly more reasonable than the current wording, but the whole concept of asking for font support is flawed. Something in the software stack has to know the right answer for text rendering purposes, but that’s usually very far removed from the interpreter and not readily (or at all) available to it. With a modern text rendering stack that solves all the hard parts for you and has access to all the system fonts, the most likely outcome is that you’ll still get something rendered even if the font the interpreter wanted to use doesn’t have all the glyphs. Of course, not all interpreters work that way today, but I think that’s the direction everything has been moving towards for good reasons. Note that asking about code point + font combinations is still an inaccurate oversimplification: it’s really about glyphs, which don’t have any simple or 1:1 relationship with code points. (Although the fanciest text rendering implementation also have use this simplification internally to some extent - one reason why Text Rendering Hates You)

prevtenet · March 3, 2024, 8:32am

To underscore this point: running @check_unicode on some runic codepoints (e.g., codepoint 5842) yields extremely inconsistent results across modern interpreters. Parchment behaves as expected; the default configuration of Windows Frotz claimes not to support this codepoint but actually does; and the default configuration of Gargoyle claims to support this codepoint but actually does not.

This is a problem even in the Glulx realm, which is presumably why Unicode-heavy games like Aotearoa perform a manual sanity-check (“player, can you read these characters?”) instead of relying on a programmatic check.

Hanna · March 3, 2024, 8:56am

Also note that Parchment just claims to support all code points, which is pretty much the only thing you can do in a browser and will usually be correct (because browsers try hard to render as much text as possible and system fonts usually cover a lot of glyphs) but can also give completely wrong result. The other interpreters likely aren’t much more sophisticated, just wrong in different cases for equally simple reasons.

Mike_G · March 3, 2024, 1:43pm

Yeah, I don’t think there is a reliable solution. The standard says “may be used” and doesn’t require it, so maybe just add a note that if done, it may break the layout of some games.

Edit: Or it could just be illegal when printing with fixed pitch (font, style, upper window, etc.)

borg323 · March 3, 2024, 4:10pm

Why not? They can be used (at least for output) without any issues, see Z-Machine 1.2 Proposal (again) - #30 by borg323

Mike_G · March 3, 2024, 4:37pm

I wouldn’t mind seeing an addition to the standard to allow output of additional unicode, but the idea that it can work without issues is flawed. It may work on Windows through the incidental fact that Windows uses UTF-16, but practically everyone else in the world uses UTF-8. Just outputting two surrogate characters side by side doesn’t work on linux.

Marvin · March 3, 2024, 5:33pm

Because it would be a major change to the Z-Machine. It’s a great idea for a future update to the Standard, but not for 1.1.

cas · March 3, 2024, 6:38pm

The latest Gargoyle (2023.1) should support this—finally—because it now uses Unifont to look up glyphs that are missing in the main fonts, meaning there should be total Unicode coverage. Although I recently learned that the Debian (and thus Ubuntu) package patches Gargoyle to not install its Unifont, and also doesn’t patch Gargoyle to use any existing system-wide Unifont, so things will not work right if you’re using that package. I’ll be doing a bug report once I get a Debian install in a VM, so hopefully future releases will work better regarding fonts.

borg323 · March 3, 2024, 7:57pm

Bad wording on my part, I wrote that in a hurry. Having coded that for Linux, I meant to say it is easy to get it to work.

In any other interpreter, surrogate pairs are like any other unprintable character, so why do we need to treat them specially?

Mike_G · March 3, 2024, 8:24pm

Understood.

I think adding the ability to output additional unicode should belong to a new version of the standard. Also, the combining of surrogates into a single character should be done in the interpreter and not rely on OS level translation like the output of two side by side surrogate characters.

In my interpreter written in Rust, output goes roughly like this:
zchars → zscii → unicode
where zchars to zscii is many-to-one and zscii to unicode is one-to-one.

Rust uses UTF-8 for its strings and surrogate code points are illegal. I’d have to change my code so that zscii to unicode becomes many-to-one. This has several knock-on implications for my interpreter because it supports saving state even mid-output.

Marvin · March 3, 2024, 11:22pm

I wrote up a test program for unicode output. It’s a little limited, in that it prints everything in the upper window, so it doesn’t test the proportional font, but I figured it might be helpful for people. I put it up at http://frobnitz.co.uk/zmachine/unicode.z5

Dannii · March 3, 2024, 11:38pm

There’s already a unicode.z5 test - maybe call it all-unicode.z5?

Mike_G · March 8, 2024, 6:11pm

Referring to memory streams, Section 7.1.2.1 of the standard says:

Output stream 3 writes to a table in dynamic memory. When the stream is selected, the table may have any contents (even the initial ‘size’ word will be ignored by the interpreter). While the stream is selected, the table’s contents are unspecified (and a game cannot safely read or write to it). When the stream is deselected, the initial word of the table holds the number of characters printed and subsequent bytes hold those characters. Similarly, in Version 6, the total width of printing (in units) will then be stored in the word at $30 in the header. (It is the programmer’s responsibility to make the table large enough: the interpreter performs no overflow checking.)

One thing not made clear is whether multiple simultaneous streams open to the same or an overlapping table is supported. The primary importance of this being where the interpreter stores the number of charcters written while the stream is open. Storing the count in the first word of the table while the stream is open would mean the last stream printed to would determine the final count when all streams are closed. If the count is stored in the interpreter’s own memory, then the first stream opened / last closed would determine the count.

Forbidding duplicate memory streams would remove the issue altogether…sort of. If the interpreter checks for duplicate table addresses, then it would be prevented (unless the tables are off by just one byte). There’s still the issue of overlapping tables, but with no initial size information given, there is no way to detect this unless the interpreter checks for overlap on every character printed to stream 3.

zarf · March 8, 2024, 6:37pm

This feels like an excellent candidate for “the behavior is implementation-defined and not guaranteed”.

Mike_G · March 8, 2024, 6:46pm

Matching addresses, running into another stream’s table, or both?

Obviously there’s the possibility of weirdness here and I just want to be able to catch and report undefined behavior.