Debugging the player-typed-a-tab problems in Inform 7 for Glulx

neroden · January 25, 2024, 3:55am

So I went way down the rabbithole last week trying to trace the source of the problem which led to the creation of extensions/Nathanael Nerode/Tab Removal-v2.i7x at 10.1 · i7/extensions · GitHub .

Here’s what’s going on. The Glk specification prohibits printing tab characters. This is enforced by the Inform 6 compiler’s “veneer.c” file, which is the file which will throw an error.

Inform 7 initially acquires commands in the snippet type. Unfortunately, in order to convert the snippet type into the text type, which is necessary to do a search-and-replace on them, Inform 7 uses the Glk “print” opcode (to print to a buffer, I guess), and Inform 6 correctly barfs, and prints an error message in the buffer instead of printing the desired text! (The text ends up containing the error message from the I6 veneer code instead of the command the user typed !!!)

So any user input which contains a character which is not legal to print under Glk cannot be converted to the text type at all, and so we can’t use Inform 7’s pattern-matching search and replace on it to remove the not-legal-to-print characters!

This seems undesirable. But I am not sure where it should be fixed. Fixes could be in the Glk spec, since the Glk implementations I’ve tested WILL print tabs, combined with a fix in the Inform 6 veneer layer. Or the fix could be deep in the Inform 7 compiler.

I think probably ideally the Inform 7 compiler would avoid using the glk print opcode to convert snippets to text, but I’m not sure where to start on that. @Zed ? @Dannii ?

Zed · January 25, 2024, 4:01am

What with the parser accepting Unicode (for Glulx) in the forthcoming version, I had suggested that any Unicode whitespace character (which includes tabs) be converted to a plain decimal 32 space prior to tokenization. It didn’t take.

zarf · January 25, 2024, 4:07am

The obvious fix is to write an I6 routine to turn tabs to spaces in the buffer directly, without trying to convert them to I7 texts.

I guess that’s what your extension is already doing, on the G side?

neroden · January 25, 2024, 4:12am

Bingo. That’s what my extension is already doing. It’s ugly; I have to replace a large hunk of I6 template/kit code.

I realized the following:
(1) the same issue will come up if the player somehow types another control character (this is possible, depending on the interpreter implementation, since there’s no Glk ban on putting the control characters in input). I didn’t fix this. I also didn’t test interpreters to see if they pass through any other control characters.

(2) the game writer might actually want to do something to interpret those keys if typed by the player. For instance, some sort of “if the player types a tab, interpret it as command completion” or “if the player types tab anywhere in the command, interpret it as wait”, or even “if the player types control-C, snark at them and then ask whether they want to quit” or something. Right now this is flat out impossible because of what’s happening.

So I thought further discussion of appropriate behavior might make sense.

Dannii · January 25, 2024, 4:22am

I have a vague memory of a prior discussion in which we considered printing invalid characters to memory streams. Considering that the Glk spec does say the output restrictions are for windows or files in text mode, anything should be legal in memory streams.

Glk §2.3 says that the line input buffer must only contain printable characters. If a tab character is making it through, then the Glk library is at fault.

Depending on how many interpreters are affected, it would be reasonable for Inform 7 to filter the input buffer itself to remove those characters.

RT__ChPrintC isn’t anywhere near smart enough to know if it’s printing to a memory stream or not. I can’t really see any way around that other than to either stop calling it, or replacing it with code that doesn’t do any checks.

Well, glk_stream_get_current would allow us to get the current stream, but there’s no Glk function to return the type of a stream. So the most that we could do is to check if the current stream is a known window stream.

neroden · January 25, 2024, 4:26am

Thank you, I hadn’t caught that part of the spec – my mistake!

So given that, maybe we should simply filter any Glulx input to eliminate Glk-non-printable characters. If so, we should basically submit the code from Tab Removal as a patch to Core Inform.

Except we should also eliminate all the other control characters. Is there consensus on which ones should be turned into " " (certainly the whitepace ones) and what should be done with the other control characters?

(On second thought, though, I’m not sure that provision of Glk 2.3 applies. We’re accepting full Unicode input now and not using “the basic text API” which restricts the input buffer to Latin-1.)

Arguably the cleanest thing to do is to not call RT_ChPrintC in the code which converts a snippet into a text, so that texts derived form a snippet can have whatever in them. (These are normally only used internally and not printed to the player unless they match patterns, so I think it is unlikely that this will trigger a Glk attempt to print an invalid character; if it does, the game writer can address it with a custom regex filter before printing.) I am not sure how to go about this though.

Dannii · January 25, 2024, 4:29am

No idea. Perhaps whitespace characters can be turned into spaces, and others into question marks?

Yes, 4.2 says

Request input of a line of Unicode characters. This works the same as glk_request_line_event(), except the result is stored in an array of glui32 values instead of an array of characters, and the values may be any valid Unicode code points.

neroden · January 25, 2024, 4:33am

Right, so the control characters are, in fact, valid Unicode code points.

Arguably the cleanest thing to do is to not call RT_ChPrintC in the code which converts a snippet into a text, so that texts derived form a snippet can have any unicode code points in them.

(Texts derived from a snippet are normally only used internally and not printed to the player unless they match patterns, so I think it is unlikely that this will trigger a Glk attempt to print an invalid character; if it does, the game writer can address it with a custom regex filter before printing.)

I am not sure how to go about this though. I was chasing the code upward and downward and I am not currently capable of writing such a patch.

Dannii · January 25, 2024, 4:47am

Found the older discussion I was thinking of:

neroden · January 25, 2024, 5:54am

FWIW this is as far as I got in the code chase:

``inform7/Internal/Inter/BasicInformKit/Sections/Text.i6t```

[ SNIPPET_TY_to_TEXT_TY to_txt snippet;
  return BlkValueCast(to_txt, SNIPPET_TY, snippet);
];

…

=
[ TEXT_TY_Cast to_txt from_kind from_value;
  if (from_kind == TEXT_TY) {
    BlkValueCopy(to_txt, from_value);
  } else if (from_kind == SNIPPET_TY) {
    TEXT_TY_Transmute(to_txt);
    TEXT_TY_CastPrimitive(to_txt, true, from_value);
  } else BlkValueError("impossible cast to text");
];

…

@h Data Conversion.
We use a single routine to handle two kinds of format translation: a
packed I6 string into an unpacked text, or a snippet into an unpacked text.

In each case, what we do is simply to print out the value we have, but with
the output stream set to memory rather than the screen. That gives us the
character by character version, neatly laid out in an array, and all we have
to do is to copy it into the text and add a null termination byte.

Nope nope nope. We want to very much not do this.

Drilling down to the Glulx version:

#ifnot; ! TARGET_ZCODE
[ TEXT_TY_CastPrimitive to_txt from_snippet from_value
  len i stream saved_stream news buffer buffer_size memory_to_free results;

  if (to_txt == 0) BlkValueError("no destination for cast");

  buffer_size = (TEXT_TY_BufferSize + 2)*WORDSIZE;

  RawBufferSize = TEXT_TY_BufferSize;
  buffer = RawBufferAddress + TEXT_TY_CastPrimitiveNesting*buffer_size;
  TEXT_TY_CastPrimitiveNesting++;
  if (TEXT_TY_CastPrimitiveNesting > TEXT_TY_NoBuffers) {
    buffer = VM_AllocateMemory(buffer_size); memory_to_free = buffer;
    if (buffer == 0)
      FlexError("ran out with too many simultaneous text conversions");
  }

  SuspendRTP();
  .RetryWithLargerBuffer;
  saved_stream = glk_stream_get_current();
  stream = glk_stream_open_memory_uni(buffer, RawBufferSize, filemode_Write, 0);
  glk_stream_set_current(stream);

  @push say__p; @push say__pc;
  ClearParagraphing(7);
  if (from_snippet) print (PrintSnippet) from_value;
  else print (PrintI6Text) from_value;
  @pull say__pc; @pull say__p;

  results = buffer + buffer_size - 2*WORDSIZE;
  glk_stream_close(stream, results);
  if (saved_stream) glk_stream_set_current(saved_stream);
  ResumeRTP();

  len = results-->1;
  if (len > RawBufferSize-1) {
    ! Glulx had to truncate text output because the buffer ran out:
    ! len is the number of characters which it tried to print
    news = RawBufferSize;
    while (news < len) news=news*2;
    i = VM_AllocateMemory(news*WORDSIZE);
    if (i ~= 0) {
      if (memory_to_free) VM_FreeMemory(memory_to_free);
      memory_to_free = i;
      buffer = i;
      RawBufferSize = news;
      buffer_size = (RawBufferSize + 2)*WORDSIZE;
      jump RetryWithLargerBuffer;
    }
    ! Memory allocation refused: all we can do is to truncate the text
    len = RawBufferSize-1;
  }
  buffer-->(len) = 0;

  TEXT_TY_CastPrimitiveNesting--;
  BlkValueMassCopyFromArray(to_txt, buffer, 4, len+1);
  if (memory_to_free) VM_FreeMemory(memory_to_free);
];
#endif;

This has to be rewritten – specifically this line has to be replaced:

  if (from_snippet) print (PrintSnippet) from_value;

I am not, however, sure what to replace it with.

Alternatively, if the glk implementations are intelligent enough to know that we can write anything at all to memory buffers, then we just need to remove the check for valid characters in the I6 veneer code. I could probably test that with an extension which replaces the I6 veneer routine. Maybe.

zarf · January 25, 2024, 10:10pm

This is obviously complicated by the veneer checking. You can bypass the veneer and use @streamunichar opcodes. But it’s a bit finicky to say “memory streams should be able to handle arbitrary bytes” when printing arbitrary bytes requires hacking around.

zarf · January 25, 2024, 10:15pm

PrintSnippet() is a simple loop that prints characters from the correct part of the global buffer.

Pretty sure what you want is to replace the entire chunk of code, though. You don’t want to call glk_stream_open_memory_uni(). You want to get straight on with copying bytes from buffer (the global array) to, er, the buffer in this function.

(You’ll definitely want to rename that local variable to avoid confusion!)

zarf · January 25, 2024, 11:25pm

More detail, although I’m afraid just in pseudocode:

Have an if (from_snippet) branch right at the top, after the if ((to_txt == 0)) check. Handle the snippet case and skip the whole stream-based mishegos.

I think it will just be a single call to BlkValueMassCopyFromArray(), giving the buffer position and total desired length. You won’t need to allocate or free memory.

EDIT: Whoops, I forgot about converting from bytes to words. Which won’t be a problem in the next I7 release, but it is a problem now! You’ll need to allocate a temporary array after all, or else dig into block guts. Sorry.

neroden · January 27, 2024, 6:29am

I’m willing to only implement this for the next I7 release, version 11 or whatever it’s numbered, and leave the current state for the current I7 release (hey, it’s taken enough years to fix as it is).

So I think maybe I forget about converting from bytes to words, call BlkValueMassCopyFromArray() and then it’s easy enough? I think I can code that.

I should add a test case while I’m doing that. Unfortunately I still find the Inform 7 testsuite structure confusing so I’m not sure how to do that.

zarf · January 27, 2024, 4:42pm

I figured it out at one point. Let me look at my notes.

Build Inform. (If you haven’t figured out how to do that, I have notes on that too.)

In the top-level inform directory, run a single test by typing

../intest/Tangled/intest inform7 Acidity

Or, if you want to see every command executed,

../intest/Tangled/intest inform7 -verbose Acidity

To create a new test, add a file named inform7/Tests/Test Cases/MyNewText.txt. This might look like:

Test: MyNewTest
For: Glulx

The Kitchen is a room.
The lamp is in the Kitchen.

Test me with "get lamp / i / drop lamp".

After the first two lines, this is just I7 source with a “test me” included.

You then type:

../intest/Tangled/intest inform7 -do -bless MyNewTest

This generates a new file inform7/Tests/Test Cases/MyNewTest--I.txt which contains the test transcript. Verify that it looks right and then check in both MyNewText.txt and MyNewTest--I.txt.

To re-run the test and check the output against MyNewTest--I.txt:

../intest/Tangled/intest inform7 MyNewTest

That’s it.

(There’s some additional hassle if you then type ../intest -from inform7 all – maybe there’s a step I didn’t quite understand – but I don’t think that’s your problem. You can run the test and check it in.)

If you’re creating a group of tests, you can add a group list in Tests/Groups. (I did this when writing Unicode tests.) But it’s optional.

neroden · January 29, 2024, 12:14pm

Thanks. That should help me a lot when trying to make test cases for some of the other stuff I’m patching.

Unbelievably, however, I can’t use this to create a test case for this situation because it behaves differently if there’s a tab in the source text like:

test me with "take	widget"

than it behaves if it’s actually typed at the command line.

Also, this doesn’t work either:

test me with "take[unicode 9]widget"

I think I’ll just have to accept that it’s not possible to make a test case for this.

I’ll make a patch with no test case and submit it.

zarf · January 29, 2024, 2:53pm

Don’t look at me, man. My test framework (regtest.py) can handle that sort of thing.

neroden · January 29, 2024, 3:28pm

Well, anyway, I have it working… better. If not perfectly.

On my fix-control-character-input-bug branch, this will work now:

let cmdln be text;
now cmdln is the player's command;

This won’t:

let cmdln be text;
now cmdln is the substituted form of "[the player's command]";

Because the latter tries to “say” the player’s command, running it through PrintSnippet, which I didn’t change because it has to use print to print to output, and it can’t tell whether it’s printing to an intermediate buffer or to output.

I think arguably the correct thing to do is to get rid of the value checks in the I6 veneer code, replacing this:

      /*  RT__ChPrintC: Check at run-time that it's safe to print (char)
            and do so.
      */
        "RT__ChPrintC",
        "c;\
           if (c<10 || (c>10 && c<32) || (c>126 && c<160))\
             return RT__Err(33,c);\
           if (c>=0 && c<256)\
             @streamchar c;\
           else\
             @streamunichar c;\
         ]", "", "", "", "", ""
    },

With a routine which will pass through Unicode control characters and print them rather than showing errors. But this has potentially large consequences – for all I know, it may be entirely undesirable for I6 users – so I don’t know whether this is an advisable change. Suggestions welcome.

The fix I have working is passing ‘make check’ and makes it possible to:

The tab removal rule is listed first in the after reading a command rulebook.
After reading a command (this is the tab removal rule):
	let cmdln be text;
	let cmdln be the player's command;
	if alternate_cmdln matches the regular expression "\t":
		replace the regular expression "\t" in alternate_cmdln with " ";

Which at any rate corresponds to the documented behavior in Writing With Inform, which claims that you can attempt to match tabs in player input. It should also be faster than the old algorithm, since it does a simple copy rather than fooling around with print statements and then doing a copy. So I’ll submit this as a patch for now…

Zed · January 29, 2024, 7:12pm

Where does it say this? I couldn’t readily find it.

I think I’d be inclined to do it in I6 in Parser.i6t in the Keyboard function:

! Print the prompt, and read in the words and dictionary addresses
                PrintPrompt();
                KeyboardPrimitive(a_buffer, a_table, DrawStatusLine);

                ! Set nw to the number of words
                #Iftrue CHARSIZE == 1;
                nw = a_table->1;
                #Ifnot;
                nw = a_table-->0;
                #Endif;

becoming:

! Print the prompt, and read in the words and dictionary addresses
                PrintPrompt();
                KeyboardPrimitive(a_buffer, a_table, DrawStatusLine);

                ! Set nw to the number of words ; strip tabs
                #Iftrue CHARSIZE == 1;
                nw = a_table->1;
                for (i = 2 : i <= (buffer->1) + 1 : i++) if ((buffer->i) == 9) buffer->i = ' ';
                #Ifnot;
                nw = a_table-->0;
                for (i = 1 : i <= (buffer-->0) : i++)  if ((buffer-->i) == 9) buffer-->i = ' ';
                #Endif;

(Not tested.)

Of course, this is much harder for an author to undo than your rule approach. But one could also do…

The tab removal rule is defined by Inter as "TAB_REMOVAL_R".
The tab removal rule is listed first in the after reading a command rulebook.

[ TAB_REMOVAL_R i;
 #Iftrue CHARSIZE == 1;
                for (i = 2 : i <= (buffer->1) + 1 : i++) if ((buffer->i) == 9) buffer->i = ' ';
                #Ifnot;
                for (i = 1 : i <= (buffer-->0) : i++)  if ((buffer-->i) == 9) buffer-->i = ' ';
                #Endif;
];

(also untested)

Draconis · January 29, 2024, 7:22pm

Veneer routines can be replaced by the library (indeed, that’s how all the action processing is implemented in the I6 standard library), so that wouldn’t have to be a change impacting I6 users at all.