The only characters which can be read from the keyboard are ZSCII characters defined for input (see S3).
However, I don’t see what should happen if a user enters a Unicode character that’s not defined for ZSCII. Should the interpreter reject input entirely, change it to a ?, delete it, or is the behavior implementation-defined?
(Don’t forget to allow unicode characters that exist in the extra characters table)
I think that the standard saying they cannot be read definitely implies they should never, ever, ever end up in the text buffer that is read by lexical analysis. What the interpreter chooses to do: reject, alter, or ignore would seem to be implementation defined. My interpreter library skips the illegal character and signals a fault to the front-end, which can handle however it pleases: beep, error message, quit, format hard drive, etc.
The reason I ask is: the Dialog compiler can’t always prove that certain words will not be used for input-matching (if they appear in a closure), so it generates dictionary words for them to be safe. If these dictionary words contain Unicode that’s not in ZSCII (including the extra characters table), this is an error, and halts compilation.
I want to change that to a warning, because if the author can guarantee they’re not being used for input-matching, then there’s no reason for it to be an error. However, I’m not sure if it should generate dictionary words with a ? in the middle of them, or generate dictionary words with that character removed, or just not create a dictionary word at all. (Ideally, it shouldn’t matter.)
Unknown characters get converted to a question mark, so removing them from the dictionary word would mean they’d never match. Generating dictionary words with question marks in them should probably work, as long as question marks aren’t in the word separators list. If you did that the words could actually be used.
Huh, you might be right, I can’t see it either. But it’s a convention at least, for Frotz and Bocfel do it too.
It’s also not so clear how a Glk library is meant to handle unicode data for glk_request_line_event.
However I did just notice that several (most?) Glk libraries get Latin-1 character events wrong too: RemGlk turns higher unicode into a ?, while GlkOte does a % 0xFF, instead of returning keycode_Unknown as section 2.4 indicates. @Zarf would you consider that to be the right reading?
Thinking about this some more:
Don’t generate a word if you can prove it won’t be used.
If there’s no way around the ambiguity, then question marks are probably the way to go. Omitting characters could lead to collisions with other words.
Something I’ve been aware of for years, but never seen mentioned is this:
While illegal characters are forbidden from being recieved as input, there’s still a sneaky way for them to end up getting parsed and/or screwing up read operations. Version 5’s pre-loaded input allows reads to start with data already in the text buffer. There’s nothing preventing code from populating that with illegal values before calling read. In a true torture test that I imagine most interpreters would assuredly fail, we pre-populate this buffer with a mix of legitimate zscii and nulls. Since nulls are guaranteed to never affect output streams and the game itself is responsible for drawing pre-loaded input, what’s in the buffer and what’s on the screen will not agree, even to the count of characters. Press backspace during that read at your peril!