Unicode parsing is coming to I7

zarf · June 3, 2023, 2:50pm

A couple of recent threads have lamented the difficulty of parsing non-English languages in I7:

This is hard because the entire parser kit uses 8-bit character arrays.

So I wrote up a proposal to fix that – to use 32-bit characters arrays instead. And then I fixed it. The change has now been accepted into the I7 main repository.

(This essentially imports the old “Unicode Parser” extension and makes it part of core Inform.) (It’s a Glulx-only change; I didn’t touch the Z-machine parser code.)

This means it will be in the next I7 release. If you’re really excited, you can build I7 from source and try it now. (You’ll also have to be excited about running bash scripts from the command line; building I7 from source is a bit of work.)

The point is, you will be able to write Inform code like this (for Glulx only):

The player carries a 灯笼.
Understand "φανός" as the 灯笼.

Which runs:

>i
You are carrying:
  a 灯笼

>drop φανός
Dropped.

>get 灯笼
Taken.

>x 灯笼
You see nothing special about the 灯笼.

Of course this does nothing at all to understand Greek or Chinese grammar. I’m using the standard English parser here. The game just thinks these objects have very strangely spelled names. But it demonstrates that the parser can recognize Unicode characters in player input and match them against the game dictionary.

Now, because I’ve reworked the parser guts, this change will break any I6 code that deals directly with the buffer array. I’ve updated everything in the Inform repository that does this. But the change will require updates for some extensions.

(If you aren’t writing I6 code in extensions, ignore this section.)

For an example, look at the “Punctuation Removal” extension. This is part of the Inform distribution so I’ve already fixed it, but it demonstrates what’s going on.

The old version of the extension had functions like this:

[ PeriodStripping i j;
	for (i = WORDSIZE : i <= (buffer-->0)+(WORDSIZE-1) : i++)
	{ 
		if ((buffer->i) == '.') 
		{	buffer->i = ' ';  
		}
	}
	VM_Tokenise(buffer, parse);
];

Now they look like this:

[ PeriodStripping i j;
	for (i = 1 : i <= (buffer-->0) : i++)
	{ 
		if ((buffer-->i) == '.') 
		{	buffer-->i = ' ';  
		}
	}
	VM_Tokenise(buffer, parse);
];

In the new Unicode world, you have to access the buffer array using -->, and the math for looping through it looks slightly different.

mathbrush · June 3, 2023, 5:12pm

Thanks, this is great! I hope to see some Inform chinese games in the future…

Draconis · June 3, 2023, 5:19pm

Excellent! I’m so glad this is happening!

Ntinakos_Sit · June 4, 2023, 12:35pm

This looks excellent! Do we have any idea when the next update is going to be released?

zarf · June 4, 2023, 2:46pm

Not a clue!

GJMen · June 4, 2023, 2:52pm

Fantastic!

Maybe i’ll implement the next iteration of Stygian Dreams to understand greek commands for absolutely no reason other than technically being able to…

ramstrong · June 4, 2023, 3:14pm

I see this is for I7. Does this also works for I6?

Ntinakos_Sit · June 4, 2023, 3:22pm

I don’t believe that such a problem exists in I6, so I guess that this only applies to I7.

zarf · June 4, 2023, 3:29pm

The problem does exist for the I6 library. But the parsers have diverged enough that you couldn’t just copy the changes over. Someone will have to pick up the task of porting the changes, and I’m afraid it won’t be me.

ramstrong · June 4, 2023, 3:38pm

Hmmm. So the problem with I6 doesn’t lie in the interface, but in the parser library. In other words, it’s possible to have Unicode I6, but that somebody will have to overhaul the whole parser library, right?

You’re right that it is a major undertakings. Thank you for answering.

Draconis · June 4, 2023, 4:13pm

Not as much of an overhaul as it is for I7, thankfully; the basic idea is to make buffer a word array instead of a byte array, which means changing a whole lot of -> to -->, but no fundamental restructuring.

The reason this is such a big deal for I7 is because the I7 compiler couldn’t handle Unicode in Understand lines, which meant you could make all these changes to the parser (and in fact zarf did quite some time ago), but I7 still wouldn’t let you put Unicode in your object names or verb definitions. This is the big change that’s being announced now.

The I6 compiler on the other hand is fine with Unicode (at least if you use -Cu, I believe) so only the library needs changing. Once the parse buffer uses words instead of bytes to hold each character, everything else should just work.

(This is one of those things I could probably do myself, but I don’t know enough about the internals of the I6 library to be confident I wouldn’t break anything. But if nobody else wants to take it I’ll give it a shot.)

zarf · June 4, 2023, 5:01pm

You also need $DICT_CHAR_SIZE=4.

And remember to call glk_request_line_event_uni() instead of glk_request_line_event() when getting the player input line.

Ntinakos_Sit · November 17, 2023, 7:10pm

@zarf Which version are you using to work on this? Because I receive error messages that in some lines that the syntax was withdrawn in April 2022.

Thanks!

zarf · November 17, 2023, 8:01pm

Could you be more specific? What are you trying to do? I6 or I7?

Ntinakos_Sit · November 17, 2023, 8:34pm

I am using I7, and I want to start developing an extension for the Greek translation of I7. Also, I saw that the version 6L38 was used for the French translation, but I would like to test your extension for Greek characters to make sure that I can keep building on it

zarf · November 17, 2023, 9:02pm

The changes that I talk about at the top of the thread are not yet released. The current version of I7 cannot handle Greek text input. (At least not without a great deal of hackery.)

I still don’t know when the next release is.

Ntinakos_Sit · November 18, 2023, 9:54am

I’ve built I7 from source though, and I thought that this way I would be able to test your changes.

In any case, for my goal, I could skip this issue for now and work with transliteration to avoid using Greek characters, so it is not a big issue right now.

zarf · November 18, 2023, 3:44pm

If you build from source, you should be able to. See the example “UnicodeUnderstanding-G.txt” here.

Ntinakos_Sit · January 18, 2025, 12:07am

Hi @zarf,

I’m currently developing the language kit for Greek, and I’ve encountered an issue with displaying and processing Greek characters in constants. For example, after replacing “by” with the Greek word “από”:

Constant BY__WD = 'από';

The output appears garbled, like this:

An Interactive Fiction Î±Ï?Ï? Ntinakos

It seems that Greek characters are not being handled properly when used in the Language.w file. They are neither displayed correctly nor understood during gameplay (I tried with other commands, such as “ξανά” - which means “again”).

What would you recommend? Should I avoid modifying Language.w with Greek characters and handle this entirely within the extension file? Or is there a workaround to make Greek characters work seamlessly in constants?

Thanks!

zarf · January 18, 2025, 3:06pm

I don’t think I tested Unicode processing in .w files. Or looked at that part of the code at all.

Please file a bug.