Any way to recognise input as a Unicode character in 6M62?

This bug report for Counterfeit Monkey points out that the current code won’t recognise that the player has tried to set the letter-remover to a wide Unicode character, but instead thinks that the player typed a two-character word.

“SET REMOVER TO :poop:” will be parsed as “SET REMOVER TO ??”.

My understanding is that while this is trivial in Inform 10.10 10.2, there is really no way to do it in 9.3 (6M62). Is this correct?

:poop: is a single Unicode code point (U+1F4A9) so the Glk library should turn it into a single question mark when the game asks for Latin-1 line input. So this behavior may be due to a bug in the library (e.g., GlkOte has some known problems around this). If that’s the case, new Inform versions with Unicode input/parser would probably still see two (mangled) characters.

Outside of that specific emoji, there’s many things that appear as a single “character” (glyph, grapheme) or emoji to a human but are encoded as multiple Unicode code points and thus will have a length greater than one in Inform. I don’t know a great solution for this, usually the best choice is side-stepping it by not asking the question “is this a single character or multiple?” in the first place.

Well, the purpose is really to give a helpful error message. Trying to set the letter-remover to a single invalid character should give the response “Only the 26 letters of the English alphabet are available to the letter-remover”, while trying to setting it to anything with more characters should give “We can only tune the letter-remover device to one letter at a time.”

However, I realise just now that there is something else going wrong, as trying to set it to an emoji gives the built-in “I can’t see any such thing” response (or variants) instead.

Apparently the [word] token (i.e. the Inform 6 WORD_TOKEN routine) in the lines below won’t match emojis:

The current release, 10.1.2, has the same parser guts as 6M62. The next release (10.2) will have the Unicode-clean parser, which is what makes this work.

However:

Quickly checking (I should know this!): Yes, GlkOte reads emoji as a UTF-16 pair even when calling glk_request_line_event_uni(). That will be true for every version of Inform; it’s deeply wired into Javascript’s string type.

>get uniline
Enter line (uni):
>>SET REMOVER TO :poop:
Got 17 characters:
S E T R E M O V E R T O ? ?
Hex: $53 $45 $54 $20 $52 $45 $4D $4F $56 $45 $52 $20 $54 $4F $20 $D83D $DCA9

In 10.2, you could write an after-reading-a-command routine to go through the input buffer (as a --> array) and convert UTF-16 pairs to 32-bit values. ($D83D $DCA9 to $1F4A9). This would be pretty easy. But, as Hanna says, that wouldn’t catch graphemes that are multiple Unicode characters, like ($79 $30A) or 👋🏿 ($1F44B $1F3FF).

1 Like

I guess it’s due to using charAt somewhere? Javascript has gained proper solutions for this a couple years ago (codePointAt and for … of string are widely implemented since 2015). So fixing this should be more feasible nowadays than when most of that code was originally written.

Oh yeah, good idea. It wouldn’t even risk mangling valid input from a library that doesn’t have this kind of bug, since surrogate code points aren’t valid code points, they’re only supposed to occur in UTF-16.

2 Likes

Since the Glk library is supposed to turn each non-Latin-1 codepoint into a ?, you could just check if the topic understood contains a ?; I don’t expect false positives like SET REMOVER TO WHAT? to happen very often (since parser players don’t use question marks very often).

I’m not sure why [word] doesn’t catch it, though…

2 Likes

Oh. Hm! I am behind the times, I see.

Yes, that should be pretty doable. I’ll take a look.

2 Likes

Turns out the problem was Punctuation Removal cutting out all the question marks.

3 Likes