Cannot parse Greek characters in Inform 7

Ntinakos_Sit · May 2, 2023, 10:50am

Hi everyone!

I have recently started working on my University thesis, which is the Greek translation of Inform 7, by building an extension for it. However, I soon realized that there is a number of issues regarding the Greek language.

As stated in this topic, non - Latin languages are actually not supported by the parser. The entire Greek alphabet is actually different than the Latin one, even though some characters are written in the same exact way (for example capital letters A,B, etc.). As a result, when I try to write some Greek text, the parser does not recognize it text.
The only place where I managed to make Greek characters work, is in the descriptions or printed names. For example, if I write down “The printed name of [room] is [Greek name of room]”, then it is shown correctly in the story.
The Unicode extensions seem to make no changes, I face the same issue when using them or not

This is a serious problem that I am facing right now, and I don’t really know how to tackle it. Is it possible to modify the parser to support Greek characters as normal input? Is there another solution that I should try to look for?

Any ideas?

rovarsson · May 2, 2023, 1:42pm

I can’t help you with this, I just wanted to say that the task you’ve set yourself is valuable and impressive. I hope some people more knowledgeable about Inform’s workings can help you out.

zarf · May 2, 2023, 2:00pm

A long time ago, I wrote an experimental parser update which enabled full Unicode input:

github.com

erkyrath/i7-exts/blob/master/Unicode Parser.i7x

Version 8/150625 of Unicode Parser (for Glulx only) by Andrew Plotkin begins here.

[Modernized for 6L38.]

[Tell the I6 compiler to generate a dictionary containing Unicode values rather than 8-bit characters. This requires I6 version 6.32 or later.]
Use DICT_CHAR_SIZE of 4.

[Change the three buffer arrays from length/byte format to simple word arrays.]
Include (-
Array gg_event --> 4;
Array gg_arguments buffer 28;
Global gg_mainwin = 0;
Global gg_statuswin = 0;
Global gg_quotewin = 0;
Global gg_scriptfref = 0;
Global gg_scriptstr = 0;
Global gg_savestr = 0;
Global gg_commandstr = 0;
Global gg_command_reading = 0;      ! true if gg_commandstr is being replayed
Global gg_foregroundchan = 0;

This file has been truncated. show original

The included example code demonstrates some Greek letters in verbs and nouns.

However (a) you need some terrible workarounds to define objects with non-Latin words, and (b) this is for an old version of Inform (9.2 or 6L38). It won’t work for the current version.

We expect that something like this extension will be integrated into Inform eventually, but the process hasn’t started as far as I know.

BadParser · May 2, 2023, 2:06pm

Back in 2016, @halkun was working on an Inform extension to parse Japanese input; however this was for the Japanese language, in Romanji (i.e. Latin characters), not actual Japanese characters. You might find this post interesting, if not entirely useful:
https://intfiction.org/t/japanese-input-parser-once-more-into-the-breech/10677

Note that it was a bit out-of-date even at the time with regard to what was then the latest version of Inform7. The main idea of this post however – that of converting the input at the “After reading a command” stage – might be a useful place for you to start.

Maybe @halkun or someone else familiar with their work will see this and step in.

otistdog · May 2, 2023, 2:12pm

A more useful thread can be found at unicode problem.

zarf may not be giving himself enough credit. Using the version of Unicode Parser linked there (which is still compatible with 6M62, though I’m not sure about 10.1.2), I got the following transcript:

Adventure Lab
You can see a ΔV here.

> x
(the ΔV)
You see nothing special about the ΔV.

> x ΔV
You see nothing special about the ΔV.

> παίρνω
(the ΔV)
Taken.

The source code used was:

Include Unicode Parser by Andrew Plotkin.

The Adventure Lab is a room.

A thing has a text called greek-name. Understand the greek-name property as describing a thing. [see Unicode Parser documentation]

A ΔV is in Adventure Lab. The printed name of it is "ΔV". The greek-name of it is "ΔV".

Include
(-	Verb '@{3C0}@{3B1}@{3AF}@{3C1}@{3BD}@{3C9}' = 'get';
	Verb '@{3C0}@{3B1}@{3B9}@{3C1}@{3BD}@{3C9}' = 'get';
	Verb '@{11D}et' = 'get';
-) 	after "Grammar" in "Output.i6t". [see Unicode Parser documentation]

and to get it to compile, it was necessary to:

Compile I7 source code.
~~Go to the projects “Build” directory (where auto.inf was generated).~~
~~Issue the command /usr/lib/x86_64-linux-gnu/gnome-inform7/inform6 -wxE2kSDGC7 $huge auto.inf output.ulx~~
Test the compiled game with a non-IDE interpreter.

Note that the C7 switch is not normally included. Someone may be able to offer a way to get the IDE to add it; I don’t know of one. [EDIT: Also note that the C7 switch does not appear to be necessary for Glulx. See post below.]

It sounds like you have your work cut out for you. Good luck!

P.S. You may also want to take special note of a mention in the Glk specification section 2.6.1:

The initial decomposition is only necessary because of a historical error in the Unicode spec: character 0x0345 (COMBINING GREEK YPOGEGRAMMENI) behaves inconsistently.

zarf · May 2, 2023, 2:25pm

I’m glad it still works with 6M62! However, it will definitely require updating for 10.1.

This is a hard problem for two reasons:

(1) The deepest parts of the parser – the code that reads in player input and operates on it – all uses byte arrays. That is, it all operates on 8-bit characters. Unicode characters just don’t fit.

(2) The Inform compiler knows this, so it doesn’t allow Unicode characters in verb synonyms, noun synonyms, etc.

The fix for (1) is to replace all those arrays with arrays of 32-bit values, and then replace all the code that operates on them. This is a lot of code. It’s not conceptually difficult, it’s just a lot of plumbing.

Fixing (2) requires a compiler update after (1) is settled.

You could weasel around (1) by changing just the first step of player input to re-encode the input into Latin-7, which can be stored in byte arrays. This would work with only a few lines of the parser changed. But you’d have to write all your verb/noun synonyms in ASCII equivalent letters: “bq\wor” instead of “βράχος”. It would be pretty unpleasant.

(Someone is going to suggest UTF-8 in 8-bit arrays. Yay, UTF-8! Don’t go there. It’s more work than changing the arrays to 32-bit arrays.)

zarf · May 2, 2023, 2:27pm

I don’t think the -C7 switch helps with the parsing part, but I may have missed something.

Zed · May 2, 2023, 4:32pm

I started porting this to 10.1 at one point, but it’s not just a matter of replacing instead of’s with replacing’s: the code being replaced has changed in enough places that it’s finicky to get right. But not a huge or an especially hard problem, just a tedious one.

Draconis · May 2, 2023, 5:05pm

I suspect the best option for right now—i.e. the best option barring an update to treat Unicode as a first-class citizen for parsing—is to add a little hack to the parser that reads input into a word array, then passes it through a translation table to convert it into ASCII which is saved in the byte array. Then you would write all your verb and noun synonyms in Betacode or something like it.

Distinctly not ideal! But until the compiler is updated, I think it’s the best we can do. It’ll be easier than defining all your verb and noun synonyms in Inform 6, certainly.

EDIT: Specifically, this works for Greek because the Greek alphabet has fewer characters than the English one. So you can replace each Greek letter with one English letter without losing any information (and even have two letters left over, which you can use for the tonos and the final sigma). For other languages, this hack wouldn’t work as well.

zarf · May 2, 2023, 5:37pm

Aha, yeah. I hadn’t heard of Betacode but it would be less grating than Latin-7.

Draconis · May 2, 2023, 7:50pm

There’s probably an equivalent for Modern Greek that doesn’t have to worry about lunate sigma and polytonic accents, but betacode is what I know best. The two letters it doesn’t use are J and V (except in really old dialectal texts where V is used for digamma), so I would use J for final sigma (or just ignore the difference between the different sigmata) and V for the tonos. The advantage of V over the slashes and parentheses of standard betacode is that you don’t have to worry about the parser mistaking them for word boundaries.

At which point the question becomes, are you willing to type Understand “biblivo” as the βιβλίο all over the place? Or does that undermine the natural-language-ness too much?

otistdog · May 2, 2023, 10:04pm

I think you’re right – that switch is more applicable to Z-Machine.

I was confused by the fact that within the IDE interpreter I got responses like:

Adventure Lab
You can see a ΔV here.

>x
That's not a verb I recognise.

>x ΔV
That's not a verb I recognise.

>παίρνω
That's not a verb I recognise.

but the same compiled output.ulx file (made without the C7 switch) seems to work just fine with Gargoyle, so I guess the issue is something within the Linux IDE interpreter for 6M62.

Ntinakos_Sit · May 3, 2023, 4:34pm

First of all I would really like to thank all of you for your answers! @zarf @Draconis @rovarsson @otistdog

Of course, it is too much at that time to focus on Ancient Greek. at are not familiar in English. Even for us, Greeks, reading Ancient Greek is quite difficult, because it is much more complicated than our modern language. I don’t wanna think about all the foreign people that want to get involved with it

My primary goal would be to work for the Modern Greek language, and make Inform accessible to people that are not familiar with English. However, I understand that this is quite difficult, so I could focus more on developing a “greeklish” version, using for instance “vivlio” or “biblio” instead of the word “βιβλίο” (which means book for anyone interested). By providing instructions to the user, regarding the correct “translation” from Greek to greeklish, I could see that becoming reality in the next months probably.

I would really appreciate if you could also help me on how to get started with those changes, because I am a relatively new user of Inform and I have little to no idea on how to modify the relevant code to make this work.

Last, is the “translation” part going to take the most time according to your opinion (analyzing the grammar, and all that related that has already been created for other languages)?

otistdog · May 4, 2023, 2:30am

I think the expertise that you are seeking is most likely to be found from someone who has already worked on a translation to a non-English language. Regrettably, I am not one of them. @Natrium729 may be willing to offer some guidance.

WWI 27.27 Translating the language of play basically suggests reading the Inform Designer’s Manual, 4th edition (aka DM4) and then inspecting the built-in extension “English Language by Graham Nelson” and seeing what you can do. That’s only part of the job, though. You would then want to go through the entire Standard Rules and note all of the responses there, then issue replacments (perhaps via an extension of your own).

The good news is that this forum is a pretty good place to get help.

Natrium729 · May 4, 2023, 3:57pm

If you’re quite new to Inform, it’s understandable you’re having trouble getting your head around all that.

You can get away with a lot only with Inform 7, but at one point you’ll have to dive into Inform 6 (now called Inter) and Preform, which are 2 lower level parts of the whole Inform ecosystem – especially if your language’s grammar is quite different from English (for example doesn’t have a subject-verb-complement order or has cases like in German or Latin).

Regarding the parsing of Greek characters, since right Unicode support is not really here for understand lines, I suggest staying within Inform 7 and exclusively using a Latin transliteration in your understand lines (especially if there’s an “official” or widely-known one). After that it would be “easy” to make the parser accept Greek characters with @Draconis’s suggestion. (But it won’t really be easy, hence the quotation marks.)

For example, in French, all the understand lines are written without diacritics, and there’s an Inform 6 routine that strip them before parsing. (It’s just more difficult to implement that kind of conversion when Unicode characters are involved.)

I guess you’re translating 10.1? What have you got translated right now?

In addition to reading the DM4 as suggested above, I fear the only way to learn how to translate Inform is to decipher the other translations, a check a few threads on this forum.

I’m working on a guide for translating Inform right now, but it’s quite a lot of work, so in the meantime, feel free to ask questions!

Ntinakos_Sit · May 4, 2023, 4:58pm

Well, I haven’t really started working on it because I didn’t know how I should continue with that parser issue that I first mentioned. I started tinkering with 10.1 and tried to understand the way that other translations have been developed so far.

Regarding the transliteration, there is the ISO 843 (https://www.translitteration.com/transliteration/en/greek/iso-843) which someone can use to convert Greek characters to Latin, but not vice versa. So, I guess that you could tell that this is an “official” way, however this might not be the most intuitive case for the user.

Something that I also need to mention, is that some effort has already been done for the Greek translation of Inform 6, at least in some part, but I don’t know if that can be applied to work on the newer version of Inform 7.

Do you think that the best way to continue is transliteration?
Thank you a lot for your support!

Natrium729 · May 4, 2023, 5:25pm

I believe the transliteration is the way to go for everything related to user input. (As you found out, Unicode is OK in text output like descriptions).

If there’s a standard like that ISO one, you won’t have to explain the rules.
It’s true it won’t be intuitive for users, but you will be able to add a way to transliterate Greek characters in commands in a subsequent version of your translation. In the end only author would need to care about that, and you can start working on your translation right now.
It makes it easier for people without a Greek keyboard to play games in Greek.
The day Inform 7 supports understand lines containing Unicode, it will be trivial (if a bit tedious) to add them.

Yes, the work done on an Inform 6 translation can be used. In the best case, some parts will be copy-pastable; in the worst case, you would be able to take some inspiration from it.

I consider that 10.1 is not quite yet ready for translation, but it should be OK. (I decided to stay on 6L38 for French, but the Spanish and Italian updated their translation, I think.) The next Inform version will have some nice features regarding extensions, but it’s possible to start working on a translation right now.

rovarsson · May 4, 2023, 5:37pm

Nifty link, that! The only thing that’s not intuitive to me is the substitution of “bèta” into “v” where I would expect “b”. A latin “b” is just preserved as-is when I transliterate to Greek.

I have to add that I don’t have any experience with modern Greek. I studied classical Greek in highschool, and I’m aware that there are significant differences. Perhaps the modern “bèta” does sound more like a “v”. (The “b” and “v” sounds are very close to each other anyway, with a single letter to represent both, or a single in-between sound, in a number of languages I believe?)

Ntinakos_Sit · May 4, 2023, 5:40pm

What is the reason why the newer version is not the best for the translation?
Do you think that I should I go for 10.1, or stick to a previous version?

Ntinakos_Sit · May 4, 2023, 5:45pm

This is exactly the case!
The Greek “beta” sounds like “v”, and the English “b” sounds like the Greek characters “μπ”.

For example, the word “ball” in Greek is called “μπάλα”. Unfortunately you cannot see the result in the link that I provided, because the ISO 843 only applies to “Greek to Latin” and not vice versa