[log] Porting Inform 6 to Chinese

Hi all!

I’m trying to do a Chinese translation of Inform 6. (context: IFictionFR) In order not to lose my head, I’ll note my progresses and ideas here. I’ll extremely appreciate if you have any ideas or comments.

I set out to add a new option (-Cu) to allow typing the program text in UTF-8. (instead of ‘@{304A}@{3059}@{3082}’, one should be allowed to type ‘おすも’ directly) It quickly turned out that the current string compression mechanism isn’t done for Chinese. (MAX_UNICODE_CHARS is 64 not 3000, and the compression would be quite inefficient with 3000) So currently I’m trying to change the semantics of strings: when -Cu is on, strings will become polyvalent. If a string is entirely within ISO8859-1, it will stay as compressed (E1). If the string comprises a Unicode character > 0xFF, it will be uncompressed (E2). I think this will not be a great nuisance: the major use of a string is to be printed; the printing is done with @streamstr which takes either string type.

I’ll come back to say if the change works.

I think it would be better if there were a separate string compression option to control whether strings are polyvalent. People might want to use -Cu with Roman alphabets, or even within English. (I’ve wanted -Cu for years now.)

For the past few years, we’ve mostly been adding compilation options as $SETTING entries rather than command-line switches. (Too many command-line switches!) So the string compression option could be called $COMPRESS_UNICODE_CHARS, with a default value of 1, but you’d set it to 0 for your system.

I also think the compression system will work better than you expect for Chinese text. But you can test that when you get there.

Thanks for working on this.

Great zarf replying in person!

Thanks for the info. I initially wanted to add a command-line option to switch the system to “Unicode mode”, but neither -u nor -U is available. It’s conceptually ugly to have -Cu as a “global Unicode switch”, so indeed, better leave the internal workings to $SETTING entries.

Now you can have UTF-8 source input with Inform 6. A dirty hack, though, I hope, bearable in its dirtiness.

Cu.patch (on the I6 compiler tree of I7)
unidicttest.inf (Test file with $MAX_UNICODE_CHARS=4000 and a famous Chinese text for absence of duplicating characters. The compression algorithm in fact works quite well with Chinese, as zarf suggested.)

The idea is to extend ISO type to represent UTF-8 bytes, and do the UTF-8 -> Unicode expansion in text_* routines.


I’ve finally sucked this into my work tree and tested it. Looks good, except for a couple of bugs in the 2-byte and 3-byte UTF8 cases – easily fixed.

See this post: https://intfiction.org/t/i6-patch-for-utf-8-source-files/3316/1

Thanks again!

I’ve pulled this into the I6-in-I7 tree here: https://github.com/DavidKinder/Inform6 This will eventually become Inform 6.33.

By the way, do you have a name that you’d like to appear in the credits?

@DavidK: Thanks! You can put “Xun Gong”.

How did this project turn out? Is it finished?