What happens if you type an unrecognized character into the Z-machine?

Draconis · January 27, 2025, 4:48pm

Per the Standard:

The only characters which can be read from the keyboard are ZSCII characters defined for input (see S3).

However, I don’t see what should happen if a user enters a Unicode character that’s not defined for ZSCII. Should the interpreter reject input entirely, change it to a ?, delete it, or is the behavior implementation-defined?

Mike_G · January 27, 2025, 9:39pm

(Don’t forget to allow unicode characters that exist in the extra characters table)

I think that the standard saying they cannot be read definitely implies they should never, ever, ever end up in the text buffer that is read by lexical analysis. What the interpreter chooses to do: reject, alter, or ignore would seem to be implementation defined. My interpreter library skips the illegal character and signals a fault to the front-end, which can handle however it pleases: beep, error message, quit, format hard drive, etc.

Draconis · January 27, 2025, 9:43pm

The reason I ask is: the Dialog compiler can’t always prove that certain words will not be used for input-matching (if they appear in a closure), so it generates dictionary words for them to be safe. If these dictionary words contain Unicode that’s not in ZSCII (including the extra characters table), this is an error, and halts compilation.

I want to change that to a warning, because if the author can guarantee they’re not being used for input-matching, then there’s no reason for it to be an error. However, I’m not sure if it should generate dictionary words with a ? in the middle of them, or generate dictionary words with that character removed, or just not create a dictionary word at all. (Ideally, it shouldn’t matter.)

Dannii · January 28, 2025, 12:10am

Unknown characters get converted to a question mark, so removing them from the dictionary word would mean they’d never match. Generating dictionary words with question marks in them should probably work, as long as question marks aren’t in the word separators list. If you did that the words could actually be used.

Mike_G · January 28, 2025, 3:55am

Are you saying that is standard behavior for z-machine input? Because I don’t see anywhere in the standard that would suggest that to be the case.

Dannii · January 28, 2025, 6:53am

Huh, you might be right, I can’t see it either. But it’s a convention at least, for Frotz and Bocfel do it too.

It’s also not so clear how a Glk library is meant to handle unicode data for glk_request_line_event.

However I did just notice that several (most?) Glk libraries get Latin-1 character events wrong too: RemGlk turns higher unicode into a ?, while GlkOte does a % 0xFF, instead of returning keycode_Unknown as section 2.4 indicates. @Zarf would you consider that to be the right reading?

zarf · January 28, 2025, 3:29pm

Yeah, I agree.

The spec wording is a bit poor:

keycode_Unknown (any key which has no Latin-1 or special code)

I must have failed to update that line when the Unicode API went in, but it should still be true for Latin-1 char input.

Mike_G · March 21, 2025, 9:23pm

Thinking about this some more:
Don’t generate a word if you can prove it won’t be used.
If there’s no way around the ambiguity, then question marks are probably the way to go. Omitting characters could lead to collisions with other words.

Something I’ve been aware of for years, but never seen mentioned is this:
While illegal characters are forbidden from being recieved as input, there’s still a sneaky way for them to end up getting parsed and/or screwing up read operations. Version 5’s pre-loaded input allows reads to start with data already in the text buffer. There’s nothing preventing code from populating that with illegal values before calling read. In a true torture test that I imagine most interpreters would assuredly fail, we pre-populate this buffer with a mix of legitimate zscii and nulls. Since nulls are guaranteed to never affect output streams and the game itself is responsible for drawing pre-loaded input, what’s in the buffer and what’s on the screen will not agree, even to the count of characters. Press backspace during that read at your peril!