What happens if you type an unrecognized character into the Z-machine?

Per the Standard:

The only characters which can be read from the keyboard are ZSCII characters defined for input (see S3).

However, I don’t see what should happen if a user enters a Unicode character that’s not defined for ZSCII. Should the interpreter reject input entirely, change it to a ?, delete it, or is the behavior implementation-defined?

(Don’t forget to allow unicode characters that exist in the extra characters table)

I think that the standard saying they cannot be read definitely implies they should never, ever, ever end up in the text buffer that is read by lexical analysis. What the interpreter chooses to do: reject, alter, or ignore would seem to be implementation defined. My interpreter library skips the illegal character and signals a fault to the front-end, which can handle however it pleases: beep, error message, quit, format hard drive, etc. :slight_smile:

1 Like

The reason I ask is: the Dialog compiler can’t always prove that certain words will not be used for input-matching (if they appear in a closure), so it generates dictionary words for them to be safe. If these dictionary words contain Unicode that’s not in ZSCII (including the extra characters table), this is an error, and halts compilation.

I want to change that to a warning, because if the author can guarantee they’re not being used for input-matching, then there’s no reason for it to be an error. However, I’m not sure if it should generate dictionary words with a ? in the middle of them, or generate dictionary words with that character removed, or just not create a dictionary word at all. (Ideally, it shouldn’t matter.)

Unknown characters get converted to a question mark, so removing them from the dictionary word would mean they’d never match. Generating dictionary words with question marks in them should probably work, as long as question marks aren’t in the word separators list. If you did that the words could actually be used.

1 Like

Are you saying that is standard behavior for z-machine input? Because I don’t see anywhere in the standard that would suggest that to be the case.

Huh, you might be right, I can’t see it either. But it’s a convention at least, for Frotz and Bocfel do it too.

It’s also not so clear how a Glk library is meant to handle unicode data for glk_request_line_event.

However I did just notice that several (most?) Glk libraries get Latin-1 character events wrong too: RemGlk turns higher unicode into a ?, while GlkOte does a % 0xFF, instead of returning keycode_Unknown as section 2.4 indicates. @Zarf would you consider that to be the right reading?

Yeah, I agree.

The spec wording is a bit poor:

keycode_Unknown (any key which has no Latin-1 or special code)

I must have failed to update that line when the Unicode API went in, but it should still be true for Latin-1 char input.