The only characters which can be read from the keyboard are ZSCII characters defined for input (see S3).
However, I don’t see what should happen if a user enters a Unicode character that’s not defined for ZSCII. Should the interpreter reject input entirely, change it to a ?, delete it, or is the behavior implementation-defined?
(Don’t forget to allow unicode characters that exist in the extra characters table)
I think that the standard saying they cannot be read definitely implies they should never, ever, ever end up in the text buffer that is read by lexical analysis. What the interpreter chooses to do: reject, alter, or ignore would seem to be implementation defined. My interpreter library skips the illegal character and signals a fault to the front-end, which can handle however it pleases: beep, error message, quit, format hard drive, etc.
The reason I ask is: the Dialog compiler can’t always prove that certain words will not be used for input-matching (if they appear in a closure), so it generates dictionary words for them to be safe. If these dictionary words contain Unicode that’s not in ZSCII (including the extra characters table), this is an error, and halts compilation.
I want to change that to a warning, because if the author can guarantee they’re not being used for input-matching, then there’s no reason for it to be an error. However, I’m not sure if it should generate dictionary words with a ? in the middle of them, or generate dictionary words with that character removed, or just not create a dictionary word at all. (Ideally, it shouldn’t matter.)
Unknown characters get converted to a question mark, so removing them from the dictionary word would mean they’d never match. Generating dictionary words with question marks in them should probably work, as long as question marks aren’t in the word separators list. If you did that the words could actually be used.
Huh, you might be right, I can’t see it either. But it’s a convention at least, for Frotz and Bocfel do it too.
It’s also not so clear how a Glk library is meant to handle unicode data for glk_request_line_event.
However I did just notice that several (most?) Glk libraries get Latin-1 character events wrong too: RemGlk turns higher unicode into a ?, while GlkOte does a % 0xFF, instead of returning keycode_Unknown as section 2.4 indicates. @Zarf would you consider that to be the right reading?