An experiment with parsing Japanese

It’s really impressive that Dialog can handle the transition from a SVO to SVO language so easily. I was also surprised that you included so many variations on the words, even half-width kana and romaji written with japanese input. Keep up the good work. :slight_smile:

1 Like

Thank you! It would be nice if there was a way to convert the katakana and romaji into hiragana so (dict $) predicates could be shorter; that might be possible with rewrite rules if words could be broken down into characters, maybe.

Tried implementing directions, but I encountered something unfortunate:

./dialogc -o JP.aastory -t aa test.dg japanese.dg stdlib.dg
Error: Too many distinct unicode characters in the text.

I guess this is a limitation of Dialog. I can of course remove katakana and full-width romaji characters to free up some unicode space. Worst case, I’ll drop kanji support and do everything in hiragana, like an old NES era RPG. But I’d like to avoid that if I can, so I’m going to dig through Dialog’s source to see if the number can be increased, or at least to know the limit so I can plan which characters to use wisely.

EDIT: Looks like in backend_aa.c, that a total of 127 unique unicode characters can be used. This is hard coded and I’m betting wouldn’t be easy to change. I’m going to do some thinking about how to best use this limited palette.

EDIT 2: Hiragana alone takes 87 unicode slots for all variations, so katakana is out. That leaves 40 characters for punctuation and kanji. For punctuation I think 「。、」 are the only ones really necessary, so 36 for kanji. There are more verbs in the standard library than that (even allowing that some would use the same kanji, like LOOK and SEARCH), so I can’t handle all of them. Relations and directions wouldn’t take too many kanji, but it’s a bit weird having only those in kanji. So, I think I am going to limit myself to hiragana only for now.

1 Like

Aww. That’s too bad. :confused: I’m guessing the limit is to adhere to Inform standards? I doubt javascript would require that limit.

I wonder though… If you’re coming up against the wall in just the standard library, does that mean the entire game itself will have to be written in just hiragana? That’s kind of rough.

The ASCII 127 characters are exempt from the unicode limit so thankfully I can still write the code itself using the standard library as a basis.

Ah. That’s good to hear. However, what I actually meant was the game (like room descriptions and the story and such), not the code itself. Will all room descriptions be in hiragana?

A user of the library can still put kanji or katakana into (look *) and the like, they just will have a limit on how many kanji they can use (depending on how many unique hiragana I use in the library), if that’s what you were asking. ASCII is allowed as well.

Kind of related, I am thinking now that it might be relatively easy to allow switching between hiragana and romaji output from the library by some sort of (if) (romaji output) (then) variable and a romaji on/off command. The main requirement would be that I have both a (romaji *) and (hiragana *) name for everything, and define something like:

(name $thing)
    (if) (romaji output) (then) (romaji $thing) (else) (hiragana $thing) (endif)

I might just do this.

1 Like

Looking through the specification of the Å-machine, there are several references to word size being “currently always 2”, which suggests to me that @lft has thought of allowing greater word sizes, which could allow for a greater number of unicode characters per story. So maybe eventually a full kana + kanji Dialog library isn’t completely impossible after all. : )

Hi!

Yes, this is a limitation of the current version of the Å-machine. The long-term plan is to support both 16-bit and 32-bit stories in the same file format, and have the compiler pick the larger memory model if necessary.

In the 32-bit format, there will be no limits on the number of unicode characters. In addition, the maximum number of objects, dictionary words, and heap storage cells will increase from about 8000 to something ridiculously large.

Interpreters on resource-limited systems can refuse to run 32-bit stories upfront (as they can already do e.g. if the file is too big).

There is no 128-character limit in the interactive debugger, so you should be able to develop the library and run stories in that environment, until a 32-bit Å-machine is available.

4 Likes

Cool! Good to know all of this!

So I don’t know much about the internals of Dialog—but on the Z-machine version 5 and later, it’s possible to ask the interpreter to read in a line of input, put it in a text buffer, let Z-code mess around with that buffer, and then let the interpreter continue its tokenizing and such.

This wouldn’t be very elegant, but wouldn’t it be possible for Dialog to have an entry point (call it preprocess $in into $out or whatever) that takes a list of integers and returns a list of integers, representing the codepoints of the characters? If this entry point was defined, it would be called in between the input-reading and tokenizing steps, and could (say) convert all hiragana and katakana losslessly to roumaji through some tedious-but-boring arithmetic. All dicts could then contain only roumaji, while the player could type in whatever orthography they liked (fullwidth roumaji, katakana, etc…) and have the game understand it.

This wouldn’t get around the 128-character limit, and I have no idea what the Å-machine does when it reads input, but it seems like a not-too-impossible place to start?

That seems like a pretty cool trick, but ideally you’d want kanji along with kana for both input and output. I don’t think your method would be able to handle kanji if the dict was in romaji.

A lot of Japanese words are pronounced the same and the kanji help clarify which meaning you are after. For example, 着る, 切る, and 斬る mean wear, cut, and kill, but they’re all written as “kiru” in hiragana. You can also make the distinction through context, which is how speech works, but it’s just a lot more direct with kanji. Children’s games like Pokemon also write all text in hiragana with spaces (since they don’t know many kanji), but adults actually find this more difficult to read.

Sorry if you knew all that. Just trying to clarify why kanji is important along with kana.

1 Like

Oh, no, it’s a very good point! I’m just not sure how that could best be handled if we want players to be able to type in different orthographies.

The easiest way to handle it is probably to not do anything at all to kanji in preprocessing, and then have synonyms handled within the game code: the verb “kill” could be given the synonyms “kiru” and “斬ru”, for example. That would make the preprocesser completely general so it wouldn’t need to know about which particular kanji the game author was using.

It’s always tempting to allow low-level platform details to seep up into a high-level language, especially when it seems to solve a practical problem right now, at a low cost.

But there’s a good reason for maintaining a strict separation here: Dialog is not tied to a particular character encoding. This allows the same stories to run on the Z-machine, which uses the ZSCII character set (a peculiar 8-and-10-bit hybrid with 97 user-definable glyphs) and the current Å-machine, which uses an 8-bit encoding (ASCII + 128 user-definable glyphs). The same Dialog code can also run in the interactive debugger, which uses UTF-8 internally, and it could in principle be compiled to run on a UCS-2-based system (like Javascript, or perhaps a future version of the Å-machine in 32-bit mode).

If we expose codepoints to the high-level program, then they would be different depending on the platform. For instance, å is usually represented by character code 201 on the Z-machine, but it’s 229 in Unicode/UCS-2, and it can be any value in the range 128-255 on the current Å-machine. Therefore we would end up with either 1. stories and libraries that are tied to a particular platform, or 2. a compatibility layer (or “shim”) in every platform, translating codepoints back and forth between Unicode and the native encoding. In my opinion, both of these options are problematic.

When targeting the current version of the Å-machine, the compiler maps every character appearing in the source code to a unique byte value. The resulting 8-bit encoding scheme performs well on vintage hardware, but as a consequence, different storyfiles can have different character encodings. In fairness, Unicode does play a part in this: There is a table in the storyfile, mapping each non-ASCII character to a Unicode glyph. But interpreters can consult this table at startup, piece together a font based on it, and then throw the table away. And they can refuse upfront to run a story if it contains an unsupported glyph. There is never a need to print an arbitrary unicode character that wasn’t in the table, so there is never a need to keep a huge full-unicode font around at runtime. That would be necessary if codepoints were exposed to the high-level program.

That is why I think the cleanest option is to allow the Å-machine to run in one of two modes: a small memory model (16-bit words and 8-bit characters) and a large memory model (32-bit words and 16-bit characters). They should be fully compatible at the Dialog source-code level. The story author shouldn’t have to worry about what memory model to use. The compiler should just automatically select the most appropriate memory model based on the number of characters, words, and objects in the story. When the story grows too large, it won’t run on old hardware anymore—but the semantics of the language will remain exactly the same, regardless of how a particular character is represented at the bit level.

3 Likes

If I’m reading the Z-machine spec right, the “Unicode translation table” can be modified dynamically. I have no idea how well this is supported and whether it can help with the issue here.

I am doing some thinking about the pronoun system.

Japanese has three ways to say “it” that I am aware of: これ (“this thing”, something close), それ (“that thing”, something a bit farther away) and あれ (“that thing over there”, something far away). While I could just have all of these be the same “it”, I wonder if it would be possible to do something smart. Maybe これ could be things on your person, それ could be things on someone else’s person, あれ could be anything in another room? The thing that’s not clear to me is what to consider things that are in the room but not on your person or on the person of an NPC: it’s nebulous who is closer to the objects so we can’t make sense of it spatially. I’ll have to think some more.

I am also thinking about changing what object gets “noticed” by the pronoun system: because Japanese has an explicit topic marker は, maybe it would make sense to have that be what gets noticed for pronouns. I’d probably need to rephrase certain messages, like “物 を 持ちます” (you take the thing) into “物 は 持たれます” (the thing is taken). Again, something I’ll be thinking about.

I’m not sure that それ (sore) means “something a bit farther away”, I think it means “something close to the listener”, as opposed to これ (kore) which is “something close to the talker” (i.e. “this”). I agree that あれ (are) corresponds to “that”.

So when the voice of the parser says “you’re already holding that”, perhaps それ is appropriate, and when it says “that’s fixed in place”, I would imagine that the choice of あれ or これ would affect whether we perceive the voice of the parser to be embodied in the room or not. But this is speculation on my part, and I’m certainly no expert. How is this usually handled in descriptive passages in books and voice-overs?

As you may know, the Dialog standard library tracks the narrator’s “it” separately from the player’s “it”, so in many situations “it” will actually be ambiguous. A Japanese library could perhaps do the same, and allow the player to refer to any of these objects using any of the three words これ / それ / あれ, but then look at the specific word choice when computing the likelihood of each interpretation. If これ was used, and the narrator’s “it” refers to something worn by the player character, but the player’s “it” refers to something in the room, then object referred to by the narrator’s “it” was probably intended, and so on.

The topic marker is interesting! I haven’t thought about its implications for a parser game. One way to approach it could be to define ($ は) to print the name of the given object followed by は, but only if it is distinct from the current topic. The same predicate should set the current topic, as a side-effect.

3 Likes

I have found a way to parse Japanese without artificially requiring spaces between words. The trick is to break the sentence down into individual symbols, then transform this list of symbols into a list of recognized word-tokens. Currently it requires defining a (token $) predicate for every recognizable Japanese word-form which feels very onerous. I have tried telling Dialog that every dict word is a token (so only verbs and particles would need to be explicitly defined) but couldn’t figure out the right syntax for that. In any case, here is the proof of concept, which only lets you (try to) look north or look at an apple.

> りんごを見る
TODO: Look at object #りんご

> 北を見る
TODO: Look in direction #北

And the code:

(grab 0 from $Everything into [] leaving $Everything)
(grab $N from [$Head | $MoreIn] into [$Head | $MoreOut] leaving $Remainder)
	($N minus 1 into $Nm1)
	(grab $Nm1 from $MoreIn into $MoreOut leaving $Remainder)
(tokenize [] into [])
(tokenize $Symbols into [$CandidateToken | $RemainingTokens])
	*(token $CandidateToken)
	(split word $CandidateToken into $CandidateSymbols)
	(length of $CandidateSymbols into $N)
	(grab $N from $Symbols into $CandidateSymbols leaving $RemainingSymbols)
	(tokenize $RemainingSymbols into $RemainingTokens)
(token @きた)
(token @北)
(token @みる)
(token @見る)
(token @りんご)
(token @を)
#北
(direction *)
(name *)	北
(dict *)	きた
(understand $Sentence as [$Target を 見る])
	(join words $Sentence into $Glob)
	(split word $Glob into $Symbols)
        (tokenize $Symbols into $Tokens)
	(split $Tokens by [を] into $NounPhrase and $Verb)
	*($Candidate is one of [みる 見る])
	([$Candidate] = $Verb)
	{
		(understand $NounPhrase as direction $Target)
	(or)
		(understand $NounPhrase as any object $Target)
	}
(perform [$Target を 見る])
	(if) (direction $Target) (then)
		TODO: Look in direction $Target (line)
	(else)
		TODO: Look at object $Target (line)
	(endif)
(current player #player)
(room #room)
(#player is #in #room)
#りんご
(name *)	りんご
(item *)
(* is #in #room)

There is obviously a mountain of work remaining to make something useful out of this, but I am excited by the possibilities.

3 Likes

There’s (unknown word $) for testing whether a word exists in the dictionary or not. But it won’t be efficient.

If there were a corresponding (known word $) that could be multi-queried to backtrack over words in the dictionary, then that might have been handy. Alas, there isn’t, and it would still involve a lot of brute-force iteration.

The most useful addition would be a built-in for splitting a word into a known dictionary word and an arbitrary ending (backtracking over every possible match). Such a built-in could also be used to implement removable word endings at the library level. Hmm, I’ll think about this.

3 Likes

A quick update, and a question. I am working on compounded nouns (starting with directions) with my space-free parsing, so far you can say “northとsouth” and you will move north then south. (I’m not 100% sure と is the right particle to use as a shorthand for “…行って…行く”, but then “north and south” as shorthand for “go north then go south” is a little unusual English.)

Right now I’ve hardcoded @north and @south as recognizable tokens. I would like to be able to say “every dict word is a token”, but I’ve tried many things and can’t quite figure out the syntax. Does anyone know how I can do this? Thanks in advance. : )

(grab 0 from $Everything into [] leaving $Everything)
(grab $N from [$Head | $MoreIn] into [$Head | $MoreOut] leaving $Remainder)
        ($N minus 1 into $M)
        (grab $M from $MoreIn into $MoreOut leaving $Remainder)

(tokenize [] into [])
(tokenize $Symbols into [$CandidateToken | $RemainingTokens])
        *(token $CandidateToken)
        (split word $CandidateToken into $CandidateSymbols)
        (length of $CandidateSymbols into $N)
        (grab $N from $Symbols into $CandidateSymbols leaving $RemainingSymbols)
        (tokenize $RemainingSymbols into $RemainingTokens)


(token @north)
(token @south)
(token $Token)
        *(compounder-token $Token)

(compounder-token @と)
(compounder-token @、)
(compounder-token @,)


(parse direction list $Words [$Head | $Tail])
        (join words $Words into $Glob)
        (split word $Glob into $Symbols)
        *(tokenize $Symbols into $Tokens)
        *(compounder-token $Compounder)
        (split $Tokens by $Compounder into $Left and $Right)
        *(parse direction $Left $Head)
        *(parse direction list $Right $Tail)

#southroom
(room *)
(from * go #north to #northroom)

#northroom
(room *)
(from * go #south to #southroom)

#player
(current player *)
(* is #in #southroom)

Unfortunately there is no way to do this at the moment. I have been meaning to add a built-in for backtracking over every way of splitting a word in two parts, where the first part is always a recognized dictionary word—starting with the longest. So if you have north and northwest in the dictionary, and you get northwestern as part of the player’s input, then a multi-query to this predicate would first return northwest and ern; and then north and western. Such a mechanism could be used for your Japanese parser (with a multi-query), and also to implement removable word endings at the library level (with a normal query, always going with the longest match).

2 Likes