A couple of recent threads have lamented the difficulty of parsing non-English languages in I7:
This is hard because the entire parser kit uses 8-bit character arrays.
So I wrote up a proposal to fix that – to use 32-bit characters arrays instead. And then I fixed it. The change has now been accepted into the I7 main repository.
(This essentially imports the old “Unicode Parser” extension and makes it part of core Inform.) (It’s a Glulx-only change; I didn’t touch the Z-machine parser code.)
This means it will be in the next I7 release. If you’re really excited, you can build I7 from source and try it now. (You’ll also have to be excited about running bash scripts from the command line; building I7 from source is a bit of work.)
The point is, you will be able to write Inform code like this (for Glulx only):
The player carries a 灯笼.
Understand "φανός" as the 灯笼.
Which runs:
>i
You are carrying:
a 灯笼
>drop φανός
Dropped.
>get 灯笼
Taken.
>x 灯笼
You see nothing special about the 灯笼.
Of course this does nothing at all to understand Greek or Chinese grammar. I’m using the standard English parser here. The game just thinks these objects have very strangely spelled names. But it demonstrates that the parser can recognize Unicode characters in player input and match them against the game dictionary.
Now, because I’ve reworked the parser guts, this change will break any I6 code that deals directly with the buffer
array. I’ve updated everything in the Inform repository that does this. But the change will require updates for some extensions.
(If you aren’t writing I6 code in extensions, ignore this section.)
For an example, look at the “Punctuation Removal” extension. This is part of the Inform distribution so I’ve already fixed it, but it demonstrates what’s going on.
The old version of the extension had functions like this:
[ PeriodStripping i j;
for (i = WORDSIZE : i <= (buffer-->0)+(WORDSIZE-1) : i++)
{
if ((buffer->i) == '.')
{ buffer->i = ' ';
}
}
VM_Tokenise(buffer, parse);
];
Now they look like this:
[ PeriodStripping i j;
for (i = 1 : i <= (buffer-->0) : i++)
{
if ((buffer-->i) == '.')
{ buffer-->i = ' ';
}
}
VM_Tokenise(buffer, parse);
];
In the new Unicode world, you have to access the buffer
array using -->
, and the math for looping through it looks slightly different.