An experiment with parsing Japanese

Inspired by Mikawa’s work on a German translation of Dialog’s standard library, I’ve tried parsing Japanese in Dialog (so far only one verb). Here is the payoff (please excuse anything that is ungrammatical, I am not fully proficient with the language):

> ポット から 花 を 持つ
花 を 持ちました。

> 花 を 持つ
もう 花 を 持っています が。

Translation:

> take flower from pot
You took the flower.

> take flower
But you are already holding the flower.

Note that there are spaces between words, which is not typical of Japanese writing; this is necessary because Dialog can’t (yet?) split text into characters.

Here is the code:

(understand $文 as [$親 から $物 を 持つ])
	(reverse $文 $逆文)
	(split $逆文 by [を] into $動詞 and $逆名詞句)
	{
		($動詞 = [持つ])
	(or)
		($動詞 = [持って])
	(or)
		($動詞 = [持て])
	(or)
		($動詞 = [持ちます])
	}
	(reverse $逆名詞句 $名詞句)
	(split $名詞句 by [から] into $親言葉 and $物言葉)
	*(understand $親言葉 as single object $親)
	*(understand $物言葉 as object $物 preferably child of $親)

(understand $文 as [$親 から $物 を 持つ])
	(reverse $文 $逆文)
	(split $逆文 by [から] into $動詞 and $逆名詞句)
	{
		($動詞 = [持つ])
	(or)
		($動詞 = [持っ])
	(or)
		($動詞 = [持て])
	(or)
		($動詞 = [持ち])
	}
	(reverse $逆名詞句 $名詞句)
	(split $名詞句 by [を] into $物言葉 and $親言葉)
	*(understand $親言葉 as single object $親)
	*(understand $物言葉 as object $物 preferably child of $親)

(understand $文 as [$物 を 持つ])
	(reverse $文 $逆文)
	(split $逆文 by [を] into $動詞 and $逆名詞句)
	{
		($動詞 = [持つ])
	(or)
		($動詞 = [持って])
	(or)
		($動詞 = [持て])
	(or)
		($動詞 = [持ちます])
	}
	(reverse $逆名詞句 $名詞句)
	*(understand $名詞句 as object $物 preferably takable)

(unlikely [$物 を 持つ])
	~(item $物) (or) ($物 has relation $関係) ($関係 is one of [#partof #heldby #wornby])

(unlikely [$親 から $物 を 持つ])
	~($物 has ancestor $親)

(prevent [$親 から $物 を 持つ])
	~($物 has ancestor $親)
	(if) (animate $親) (then)
		(name $親) は (name $物) を 持ちません。
	(elseif) (container $親) (then)
		(name $物) は (name $親) の 中 に
		(if) (animate $物) (then) ありません。 (else) いません。 (endif)
	(elseif) (supporter $親) (then)
		(name $物) は (name $親) の 上 に
		(if) (animate $物) (then) ありません。 (else) いません。 (endif)
	(else)
		(name $物) は そこ に
		(if) (animate $物) (then) ありません。 (else) いません。 (endif)
	(endif)

(perform [$親 から $物 を 持つ])
	(current player $プレイヤー)
	($物 is recursively worn by $プレイヤー)
	{
		($親 = $プレイヤー)
	(or)
		($親 has ancestor $プレイヤー)
	}
	(try [remove $物])

(perform [$ から $物 を 持つ])
	(try [$物 を 持つ])

(prevent [$物 を 持つ])
	($物 を 持っている 時)
	(or) ($物 の 場所 が いい 時)
	(or) ($物 は 何か の 一部 だ 時)
	(or) ($物 は 誰か で 持つ 時)
	(or) ($物 は 誰か で 着る 時)
	(or) ($物 は 持てない 時)

($物 を 持つ 話)
	(name $物) を 持ちました。

(perform [$物 を 持つ])
	($物 を 持つ 話)
	(current player $プレイヤー)
	(now) ($物 is #heldby $プレイヤー)
	(now) ($物 is handled)

($物 を 持っている 時)
	(current player $プレイヤー)
	{
		($物 is #heldby $プレイヤー)
		もう (name $物) を 持っています が。
	(or)
		($物 is recursively worn by $プレイヤー)
		もう (name $物) を 着ています が。
	}

($物 の 場所 が いい 時)
	(fine where it is $物)
	(name $物) の 場所 が いい です が。

($物 は 何か の 一部 だ 時)
	($物 is #partof $親)
	(name $物) は (name $親) の 一部 です が。

($物 は 誰か で 持つ 時)
	($物 is #heldby $親)
	(name $物) は (name $親) で 持っています が。

($物 は 誰か で 着る 時)
	($物 is #heldby $親)
	(name $物) は (name $親) で 着ています が。

($物 は 持てない 時)
	~(item $物)
	(name $物) は 持てない です が。

Right now I haven’t figured out how to handle multi-objects delimited by と; I need to look more into how that is done with “and” in the standard library. Still, I think it’s very cool that Dialog can handle a language as different in syntax from English as Japanese with relatively minor changes.

5 Likes

Hey, congrats!

You might search for:

(parse direction list $Words [$Head | $Tail]) *(split $Words by [, und] into $Left and $Right)

and

(parse noun list $Words as $ObjList $Policy $AllAllowed) (split $Words by [, und] into $Left and $Right)

These two occurences deal with multi objects, in german the delimiters are the comma and “und”.

This is really amazing. :slight_smile:

I also don’t think it would be too annoying to use by japanese speakers because I’ve found that if you get into a conversation with a japanese speaker that is aware that you’re not very good with japanese, they’ll sometimes be considerate and put spaces between their words to help your understanding. But of course, no spaces would be ideal.

Will you also allow for the kana version of words, or will you require them to use the correct kanji every time?

(Forgive me for the double post. I’m new to posting on this site and wasn’t sure if I was replying to the correct person.)

Mikawa: Cool, I’ll take a look at that and see if I can get it working!

tayruh: I originally allowed kana, but for some reason using a slash expression for, for example, ($動詞 = [待つ/まつ/マツ/マツ]) didn’t work so I switched to the (or) expression, and didn’t include all possible kana options in this example to keep things short. If I ever release something I’ll definitely allow kana input, maybe even romaji.

2 Likes

Slash expressions only work in rule heads, at least for now.

Went down a rabbit hole trying to translate all the functions of the standard library into Japanese; today I came to the conclusion that what is important is the text that appears in game, and the programming itself can stay in English. Right now a lot of actions that need to be implemented require relations, so my next goal is to translate them.

%% Word choice predicates

(います・あります $x)
	(if) (animate $x) (then) います (else) あります (endif)
(いません・ありません $x)
	(if) (animate $x) (then) いません (else) ありません (endif)

%% 見る
%% Based on LOOK

(understand $sentence as [見る])
	*($candidate is one of [
		見る みる ミル ミル miru miru
		見ろ みろ ミロ ミロ miro miro
		見て みて ミテ ミテ mite mite
		見ます みます ミマス ミマス mimasu mimasu])
	([$candidate] = $sentence)

(perform [見る])
	(current player $player)
	(current visibility  ceiling $ceiling)
	(div @roomheader) (location headline)
	(if) (player can see) (then)
		(look $ceiling)
		($player is $relation $location)
		(make appearances $relation $location)
		(par)
	(else)
		(narrate darkness)
	(endif)

(narrate darkness)
	今、 闇 に います。
	(current visibility ceiling $ceiling)
	(notice $ceiling)

%% $Y から $X を 待つ
%% Based on TAKE $X FROM $Y

(understand $sentence as [$parent から $thing を 持つ])
	(split $sentence by [を ヲ ヲ wo wo]
		into $noun-phrase and $verb)
	*($candidate is one of [
		持つ もつ モツ モツ motsu motsu
		持て もて モテ モテ mote mote
		持って もって モッテ モッテ motte motte
		持ちます もちます モチマス モチマス mochimasu mochimasu
		motu motu motimasu motimasu])
	([$candidate] = $verb)
	(split $noun-phrase by [から カラ カラ kara kara]
		into $parent-words and $thing-words)
	*(understand $parent-words as single object $parent)
	*(understand $thing-words as object $thing preferably child of $parent)

(understand $sentence as [$parent から $thing を 持つ])
	(split $sentence by [から カラ カラ kara kara]
		into $noun-phrase and $verb)
	*($candidate is one of [
		持つ もつ モツ モツ motsu motsu
		持て もて モテ モテ mote mote
		持って もって モッテ モッテ motte motte
		持ちます もちます モチマス モチマス mochimasu mochimasu
		motu motu motimasu motimasu])
	([$candidate] = $verb)
	(split $noun-phrase by [を ヲ ヲ wo wo]
		into $thing-words and $parent-words)
	*(understand $parent-words as single object $parent)
	*(understand $thing-words as object $thing preferably child of $parent)

(unlikely [$parent から $thing を 待つ])
	~($thing has ancestor $parent)

(prevent [$parent から $thing を 持つ])
	~($thing has ancestor $parent)
	(if) (animate $parent) (then)
		(name $parent) は (name $thing) を 持たない。
	(elseif) (container $parent) (then)
		(name $thing) は (name $parent) の 中 に
			(いません・ありません $thing)。
	(elseif) (supporter $parent) (then)
		(name $thing) は (name $parent) の 上 に
			(いません・ありません $thing)。
	(else)
		(name $thing) は そこ に (いません・ありません $thing)。
	(endif)

(perform [$parent から $thing を 持つ])
	(current player $player)
	($thing is recursively worn by $player)
	{
		($parent = $player)
	(or)
		($parent has ancestor $player)
	}
	(try [remove $thing])

(perform [$parent から $thing を 持つ])
	(try [$thing を 持つ])

%% $X を 待つ
%% Based on TAKE $X

(understand $sentence as [$thing を 持つ])
	(split $sentence by [を ヲ ヲ wo wo]
		into $noun-phrase and $verb)
	*($candidate is one of [
		持つ もつ モツ モツ motsu motsu
		持て もて モテ モテ mote mote
		持って もって モッテ モッテ motte motte
		持ちます もちます モチマス モチマス mochimasu mochimasu
		motu motu motimasu motimasu])
	([$candidate] = $verb)
	*(understand $noun-phrase as object $thing preferably takable)

(unlikely [$thing を 持つ])
	{
		~(item $thing)
	(or)
		($thing has relation $relation)
		($relation is one of [#partof #heldby #wornby])
	}

(prevent [$thing を 持つ])
	(when $thing is already held)
	(or) (when $thing is fine where it is)
	(or) (when $thing is part of something)
	(or) (when $thing is held by someone)
	(or) (when $thing is worn by someone)
	(or) (when $thing can't be taken)

(narrate $thing を 持つ)
	(current player $player)
	(if) ($thing has parent $parent) ~($player has ancestor $parent) (then)
		(name $parent) から
	(endif)
	(name $thing) を 持ちました。

(perform [$thing を 持つ])
	(narrate $thing を 持つ)
	(current player $player)
	(now) ($thing is #heldby $player)
	(now) ($thing is handled)

%%

(when $thing is out of sight)
	~(player can see $thing)
	(if) (player can see) (then)
		その 物 を 見えません けど。
	(else)
		闇 に ぜんぜん 見えません けど。
	(endif)

(when $thing is already held)
	(current player $player)
	{
		($thing is #heldby $player)
		まだ (name $thing) を もっています けど。
	(or)
		($thing is recursively worn by $player)
		まだ (name $thing) を 着ています けど。
	}

(when $thing isn't directly held)
	(current player $player)
	~($thing is #heldby $player)
	(name $thing) を 持てません けど。

(when $thing is not here)
	(not here $thing)
	(name $thing) は (いません・ありません $thing) けど。

(when $thing is out of reach)
	~($thing is reachable by player)
	(if) (player can see $thing) (then)
		(name $thing) を 握られません けど。
	(else)
		(name $thing) は (いません・ありません $thing) けど。
	(endif)

(when (intangible $thing) is out of reach)
	(name $thing) は 無体 です けど。

(when $thing is part of something)
	($thing is #partof $parent)
	その (name $thing) は (name $parent) の 一部 です けど。

(when $thing is held by someone)
	($thing is #heldby $parent)
	その (name $thing) は (name $parent) の です けど。

(when $thing is worn by someone)
	($thing is #wornby $parent)
	(name $parent) は その (name $thing) を 着ています けど。

(when $thing is fine where it is)
	(fine where it is $thing)
	(name $thing) の 場所 は いい です けど。

(when ~(item $thing) can't be taken)
	(name $thing) を 持ていません けど。

~(when (supporter $) won't accept #on)

~(when (container $) won't accept #in)

(when $thing won't accept $relation)
	(if) ($relation is one of [#under #behind]) (then)
		TODO
	(else)
		TODO
	(endif)

~(when (actor supporter $) won't accept actor #on)

~(when (actor container $) won't accept actor #in)

(when $thing won't accept actor $relation)
	(if) ($relation is one of [#under #behind]) (then)
		TODO
	(else)
		TODO
	(endif)

(when $thing is already $relation $destination)
	($thing is $relation $destination)
	TODO

(when $thing is $relation $destination)
	($thing is $relation $destination)
	TODO

(when $thing is closed)
	($thing is closed)
	(name $thing) は 閉まっています けど。

(when $thing blocks passage)
	($thing blocks passage)
	(if) ($thing is closed) (then)
		(name $thing) は 閉まっています けど。
	(else)
		(name $thing) は 許可していません けど。
	(endif)

1 Like

Relations are now somewhat-in. I’m particularly unsure of how to translate #wornby, #heldby and #partof, but at least the other relations should make sense.

> ポット の 中 を 見る
ポット の 中 には 花 と 虫 です。

Translation:

> Look in the pot.
A flower and an insect are in the pot.

Here’s the new code:

%% Lists

(と-listing $list)
	(listing $list {(name $)} @と 0)

%% $X の $R を 見る
%% Based on LOOK $R $X

(understand $sentence as [$thing の $relation を 見る])
	(split $sentence by [を ヲ ヲ wo wo]
		into $noun-phrase and $verb)
	*($candidate is one of [
		見る みる ミル ミル miru miru
		見ろ みろ ミロ ミロ miro miro
		見て みて ミテ ミテ mite mite
		見ます みます ミマス ミマス mimasu mimasu])
	([$candidate] = $verb)
	(split $noun-phrase by [の ノ ノ no no]
		into $thing-words and $relation-words)
	*(understand $thing-words as single object $thing)
	*(split $relation-words by relation $relation into [] and [])

(unlikely [$thing の #in を 見る])
	~(container $thing)

(unlikely [$thing の #on を 見る])
	~(supporter $thing)

(refuse [$thing の $ を 見る])
	(just)
	{
		(when $thing is not here)
		(or) (when $thing is out of sight)
	}

(before [$thing の #in を 見る])
	($thing is opaque)
	($thing is closed)
	(first try [open $thing])

(instead of [(current room $) の #in を 見る])
	(try [見る])

(instead of [$thing の #in を 見る])
	{ (room $thing) (or) (door $thing) }
	(current room $here)
	{
		(from $here go $direction to $thing)
	(or)
		(from $here through $door to $thing)
		(from $here go $direction to $door)
	}
	(direction $direction)
	(try [look $direction])

(prevent [$thing の #in を 見る])
	~(container $thing)
	(name $thing) は 入られません けど。

(prevent [$thing の #in を 見る])
	($thing is opaque)
	($thing is closed)
	~(current visibility ceiling $thing)
	(if) (openable $thing) (then)
		(name $thing) を 閉まっています けど。
	(else)
		(name $thing) の 中 に 見えません けど。
	(endif)

(prevent [(room $thing) の #behind を 見る])
	(name $thing) の 後 を 見えません けど。

(perform [$thing の $relation を 見る])
	(collect $C)
		*($C is $relation $thing)
		(now) ($C is revealed)
	(into $children)
	(if) (empty $children) (then)
		(name $thing) の (name $relation) に は 何でも ありません。
	(else)
		(name $thing) の (name $relation) には
		(と-listing $children) です。
		(notice $children)
	(endif)

%% RELATIONS

(name #in)	中
(dict #in)	(just) 中 なか ナカ ナカ naka naka 内 ない ナイ ナイ nai nai
(name #on)	上
(dict #on)	(just) 上 した シタ シタ shita shita sita sita
(name #under)	下
(dict #under)	(just) 下 した シタ シタ shita shita sita sita
(name #behind)	後
(dict #behind)	(just) 後 後ろ
		うしろ ウシロ ウシロ ushiro ushiro usiro usiro
		あと アト アト ato ato
(name #partof)	一部
(name #heldby)	持つ
(name #wornby)	着る
2 Likes

It’s really impressive that Dialog can handle the transition from a SVO to SVO language so easily. I was also surprised that you included so many variations on the words, even half-width kana and romaji written with japanese input. Keep up the good work. :slight_smile:

1 Like

Thank you! It would be nice if there was a way to convert the katakana and romaji into hiragana so (dict $) predicates could be shorter; that might be possible with rewrite rules if words could be broken down into characters, maybe.

Tried implementing directions, but I encountered something unfortunate:

./dialogc -o JP.aastory -t aa test.dg japanese.dg stdlib.dg
Error: Too many distinct unicode characters in the text.

I guess this is a limitation of Dialog. I can of course remove katakana and full-width romaji characters to free up some unicode space. Worst case, I’ll drop kanji support and do everything in hiragana, like an old NES era RPG. But I’d like to avoid that if I can, so I’m going to dig through Dialog’s source to see if the number can be increased, or at least to know the limit so I can plan which characters to use wisely.

EDIT: Looks like in backend_aa.c, that a total of 127 unique unicode characters can be used. This is hard coded and I’m betting wouldn’t be easy to change. I’m going to do some thinking about how to best use this limited palette.

EDIT 2: Hiragana alone takes 87 unicode slots for all variations, so katakana is out. That leaves 40 characters for punctuation and kanji. For punctuation I think 「。、」 are the only ones really necessary, so 36 for kanji. There are more verbs in the standard library than that (even allowing that some would use the same kanji, like LOOK and SEARCH), so I can’t handle all of them. Relations and directions wouldn’t take too many kanji, but it’s a bit weird having only those in kanji. So, I think I am going to limit myself to hiragana only for now.

1 Like

Aww. That’s too bad. :confused: I’m guessing the limit is to adhere to Inform standards? I doubt javascript would require that limit.

I wonder though… If you’re coming up against the wall in just the standard library, does that mean the entire game itself will have to be written in just hiragana? That’s kind of rough.

The ASCII 127 characters are exempt from the unicode limit so thankfully I can still write the code itself using the standard library as a basis.

Ah. That’s good to hear. However, what I actually meant was the game (like room descriptions and the story and such), not the code itself. Will all room descriptions be in hiragana?

A user of the library can still put kanji or katakana into (look *) and the like, they just will have a limit on how many kanji they can use (depending on how many unique hiragana I use in the library), if that’s what you were asking. ASCII is allowed as well.

Kind of related, I am thinking now that it might be relatively easy to allow switching between hiragana and romaji output from the library by some sort of (if) (romaji output) (then) variable and a romaji on/off command. The main requirement would be that I have both a (romaji *) and (hiragana *) name for everything, and define something like:

(name $thing)
    (if) (romaji output) (then) (romaji $thing) (else) (hiragana $thing) (endif)

I might just do this.

1 Like

Looking through the specification of the Å-machine, there are several references to word size being “currently always 2”, which suggests to me that @lft has thought of allowing greater word sizes, which could allow for a greater number of unicode characters per story. So maybe eventually a full kana + kanji Dialog library isn’t completely impossible after all. : )

Hi!

Yes, this is a limitation of the current version of the Å-machine. The long-term plan is to support both 16-bit and 32-bit stories in the same file format, and have the compiler pick the larger memory model if necessary.

In the 32-bit format, there will be no limits on the number of unicode characters. In addition, the maximum number of objects, dictionary words, and heap storage cells will increase from about 8000 to something ridiculously large.

Interpreters on resource-limited systems can refuse to run 32-bit stories upfront (as they can already do e.g. if the file is too big).

There is no 128-character limit in the interactive debugger, so you should be able to develop the library and run stories in that environment, until a 32-bit Å-machine is available.

4 Likes

Cool! Good to know all of this!

So I don’t know much about the internals of Dialog—but on the Z-machine version 5 and later, it’s possible to ask the interpreter to read in a line of input, put it in a text buffer, let Z-code mess around with that buffer, and then let the interpreter continue its tokenizing and such.

This wouldn’t be very elegant, but wouldn’t it be possible for Dialog to have an entry point (call it preprocess $in into $out or whatever) that takes a list of integers and returns a list of integers, representing the codepoints of the characters? If this entry point was defined, it would be called in between the input-reading and tokenizing steps, and could (say) convert all hiragana and katakana losslessly to roumaji through some tedious-but-boring arithmetic. All dicts could then contain only roumaji, while the player could type in whatever orthography they liked (fullwidth roumaji, katakana, etc…) and have the game understand it.

This wouldn’t get around the 128-character limit, and I have no idea what the Å-machine does when it reads input, but it seems like a not-too-impossible place to start?

That seems like a pretty cool trick, but ideally you’d want kanji along with kana for both input and output. I don’t think your method would be able to handle kanji if the dict was in romaji.

A lot of Japanese words are pronounced the same and the kanji help clarify which meaning you are after. For example, 着る, 切る, and 斬る mean wear, cut, and kill, but they’re all written as “kiru” in hiragana. You can also make the distinction through context, which is how speech works, but it’s just a lot more direct with kanji. Children’s games like Pokemon also write all text in hiragana with spaces (since they don’t know many kanji), but adults actually find this more difficult to read.

Sorry if you knew all that. Just trying to clarify why kanji is important along with kana.

1 Like

Oh, no, it’s a very good point! I’m just not sure how that could best be handled if we want players to be able to type in different orthographies.

The easiest way to handle it is probably to not do anything at all to kanji in preprocessing, and then have synonyms handled within the game code: the verb “kill” could be given the synonyms “kiru” and “斬ru”, for example. That would make the preprocesser completely general so it wouldn’t need to know about which particular kanji the game author was using.

It’s always tempting to allow low-level platform details to seep up into a high-level language, especially when it seems to solve a practical problem right now, at a low cost.

But there’s a good reason for maintaining a strict separation here: Dialog is not tied to a particular character encoding. This allows the same stories to run on the Z-machine, which uses the ZSCII character set (a peculiar 8-and-10-bit hybrid with 97 user-definable glyphs) and the current Å-machine, which uses an 8-bit encoding (ASCII + 128 user-definable glyphs). The same Dialog code can also run in the interactive debugger, which uses UTF-8 internally, and it could in principle be compiled to run on a UCS-2-based system (like Javascript, or perhaps a future version of the Å-machine in 32-bit mode).

If we expose codepoints to the high-level program, then they would be different depending on the platform. For instance, å is usually represented by character code 201 on the Z-machine, but it’s 229 in Unicode/UCS-2, and it can be any value in the range 128-255 on the current Å-machine. Therefore we would end up with either 1. stories and libraries that are tied to a particular platform, or 2. a compatibility layer (or “shim”) in every platform, translating codepoints back and forth between Unicode and the native encoding. In my opinion, both of these options are problematic.

When targeting the current version of the Å-machine, the compiler maps every character appearing in the source code to a unique byte value. The resulting 8-bit encoding scheme performs well on vintage hardware, but as a consequence, different storyfiles can have different character encodings. In fairness, Unicode does play a part in this: There is a table in the storyfile, mapping each non-ASCII character to a Unicode glyph. But interpreters can consult this table at startup, piece together a font based on it, and then throw the table away. And they can refuse upfront to run a story if it contains an unsupported glyph. There is never a need to print an arbitrary unicode character that wasn’t in the table, so there is never a need to keep a huge full-unicode font around at runtime. That would be necessary if codepoints were exposed to the high-level program.

That is why I think the cleanest option is to allow the Å-machine to run in one of two modes: a small memory model (16-bit words and 8-bit characters) and a large memory model (32-bit words and 16-bit characters). They should be fully compatible at the Dialog source-code level. The story author shouldn’t have to worry about what memory model to use. The compiler should just automatically select the most appropriate memory model based on the number of characters, words, and objects in the story. When the story grows too large, it won’t run on old hardware anymore—but the semantics of the language will remain exactly the same, regardless of how a particular character is represented at the bit level.

3 Likes