Support for CJK Input?

halkun · February 11, 2013, 5:24am

I was playing with Inform and noticed that it accepted non-Latin character sets. (I assume this is Unicode). However, I was wondering if there is a way to alter the parser to accept a verb/object swap under certain circumstances.

I will like to be able to input Japanese, but here’s the kicker, only sometimes. (Using the OS’s CJK input system)

The problem is that with Japanese, the verb goes at the end of the sentence, and the object in the middle. Here’s an example…

Open the mailbox
becomes
郵便箱を開ける

Breaking down the Japanese it looks like this
郵便箱 (Yūbinbako) [Mailbox] {Noun}
を (o) [part of speech saying the last word was an object] {particle}
開ける (Akeru) [open] {verb}

going further

Read the leaflet
手紙を読む
手紙 (Tegami) [Letter] {Noun}
を (o)[part of speech saying the last word was an object] {particle}
読む (Yomu) [Read] {verb}

The way I’m thinking about implementing it is to have an NPC that can’t speak English, so you must speak with her in Japanese. I’m thinking of having the parser pick up the character’s name (Which will always be first) and then “swap” the object/verb order to parse the rest of the input. (or, if possible overriding the parser) An example…

Kaori, pick up the paper.
香さん、新聞を取って

The first part 「香さん」 is her name, and can hopefully trigger the parser to switch modes.
From a feasibility standpoint, I’m just thinking outloud. Is this possible?

zarf · February 11, 2013, 6:15am

It’s possible, because (worst case) the parser is completely replaceable.

Just swapping the last word around to the beginning is pretty easy. If you want to do a more fluent parser, which relies on features of the Japanese language (like the “o” thing you mentioned), it will probably be a lot of work. People have done this (more or less) for various European languages – German, Russian, Spanish – but not Japanese that I know of.

Another significant problem: at the moment, the parser’s internal command buffer is an array of bytes. So it can’t actually hold Japanese Unicode characters. Rewriting this as an array of 32-bit integers is additional work. (Definitely work which should be done, but again, nobody’s done it as far as I know.)

zarf · February 12, 2013, 5:30pm

Thinking about it, I definitely plan to tackle the latter problem – writing an I7 extension that changes line input (and command processing) to be 32-bit-clean.

However, not until the next I7 build ships. It will be all under-the-hood work, and a major I7 release will probably break it, so I might as well wait.

(It will not be a panacea, because I7 “Understand ‘…’ as…” lines do not currently accept Unicode. So even with this patch, you’ll have to go down to I6 to define nouns and verbs. But one step at a time.)