An experiment with parsing Japanese

If I’m reading the Z-machine spec right, the “Unicode translation table” can be modified dynamically. I have no idea how well this is supported and whether it can help with the issue here.

I am doing some thinking about the pronoun system.

Japanese has three ways to say “it” that I am aware of: これ (“this thing”, something close), それ (“that thing”, something a bit farther away) and あれ (“that thing over there”, something far away). While I could just have all of these be the same “it”, I wonder if it would be possible to do something smart. Maybe これ could be things on your person, それ could be things on someone else’s person, あれ could be anything in another room? The thing that’s not clear to me is what to consider things that are in the room but not on your person or on the person of an NPC: it’s nebulous who is closer to the objects so we can’t make sense of it spatially. I’ll have to think some more.

I am also thinking about changing what object gets “noticed” by the pronoun system: because Japanese has an explicit topic marker は, maybe it would make sense to have that be what gets noticed for pronouns. I’d probably need to rephrase certain messages, like “物 を 持ちます” (you take the thing) into “物 は 持たれます” (the thing is taken). Again, something I’ll be thinking about.

I’m not sure that それ (sore) means “something a bit farther away”, I think it means “something close to the listener”, as opposed to これ (kore) which is “something close to the talker” (i.e. “this”). I agree that あれ (are) corresponds to “that”.

So when the voice of the parser says “you’re already holding that”, perhaps それ is appropriate, and when it says “that’s fixed in place”, I would imagine that the choice of あれ or これ would affect whether we perceive the voice of the parser to be embodied in the room or not. But this is speculation on my part, and I’m certainly no expert. How is this usually handled in descriptive passages in books and voice-overs?

As you may know, the Dialog standard library tracks the narrator’s “it” separately from the player’s “it”, so in many situations “it” will actually be ambiguous. A Japanese library could perhaps do the same, and allow the player to refer to any of these objects using any of the three words これ / それ / あれ, but then look at the specific word choice when computing the likelihood of each interpretation. If これ was used, and the narrator’s “it” refers to something worn by the player character, but the player’s “it” refers to something in the room, then object referred to by the narrator’s “it” was probably intended, and so on.

The topic marker is interesting! I haven’t thought about its implications for a parser game. One way to approach it could be to define ($ は) to print the name of the given object followed by は, but only if it is distinct from the current topic. The same predicate should set the current topic, as a side-effect.

3 Likes

I have found a way to parse Japanese without artificially requiring spaces between words. The trick is to break the sentence down into individual symbols, then transform this list of symbols into a list of recognized word-tokens. Currently it requires defining a (token $) predicate for every recognizable Japanese word-form which feels very onerous. I have tried telling Dialog that every dict word is a token (so only verbs and particles would need to be explicitly defined) but couldn’t figure out the right syntax for that. In any case, here is the proof of concept, which only lets you (try to) look north or look at an apple.

> りんごを見る
TODO: Look at object #りんご

> 北を見る
TODO: Look in direction #北

And the code:

(grab 0 from $Everything into [] leaving $Everything)
(grab $N from [$Head | $MoreIn] into [$Head | $MoreOut] leaving $Remainder)
	($N minus 1 into $Nm1)
	(grab $Nm1 from $MoreIn into $MoreOut leaving $Remainder)
(tokenize [] into [])
(tokenize $Symbols into [$CandidateToken | $RemainingTokens])
	*(token $CandidateToken)
	(split word $CandidateToken into $CandidateSymbols)
	(length of $CandidateSymbols into $N)
	(grab $N from $Symbols into $CandidateSymbols leaving $RemainingSymbols)
	(tokenize $RemainingSymbols into $RemainingTokens)
(token @きた)
(token @北)
(token @みる)
(token @見る)
(token @りんご)
(token @を)
#北
(direction *)
(name *)	北
(dict *)	きた
(understand $Sentence as [$Target を 見る])
	(join words $Sentence into $Glob)
	(split word $Glob into $Symbols)
        (tokenize $Symbols into $Tokens)
	(split $Tokens by [を] into $NounPhrase and $Verb)
	*($Candidate is one of [みる 見る])
	([$Candidate] = $Verb)
	{
		(understand $NounPhrase as direction $Target)
	(or)
		(understand $NounPhrase as any object $Target)
	}
(perform [$Target を 見る])
	(if) (direction $Target) (then)
		TODO: Look in direction $Target (line)
	(else)
		TODO: Look at object $Target (line)
	(endif)
(current player #player)
(room #room)
(#player is #in #room)
#りんご
(name *)	りんご
(item *)
(* is #in #room)

There is obviously a mountain of work remaining to make something useful out of this, but I am excited by the possibilities.

3 Likes

There’s (unknown word $) for testing whether a word exists in the dictionary or not. But it won’t be efficient.

If there were a corresponding (known word $) that could be multi-queried to backtrack over words in the dictionary, then that might have been handy. Alas, there isn’t, and it would still involve a lot of brute-force iteration.

The most useful addition would be a built-in for splitting a word into a known dictionary word and an arbitrary ending (backtracking over every possible match). Such a built-in could also be used to implement removable word endings at the library level. Hmm, I’ll think about this.

3 Likes

A quick update, and a question. I am working on compounded nouns (starting with directions) with my space-free parsing, so far you can say “northとsouth” and you will move north then south. (I’m not 100% sure と is the right particle to use as a shorthand for “…行って…行く”, but then “north and south” as shorthand for “go north then go south” is a little unusual English.)

Right now I’ve hardcoded @north and @south as recognizable tokens. I would like to be able to say “every dict word is a token”, but I’ve tried many things and can’t quite figure out the syntax. Does anyone know how I can do this? Thanks in advance. : )

(grab 0 from $Everything into [] leaving $Everything)
(grab $N from [$Head | $MoreIn] into [$Head | $MoreOut] leaving $Remainder)
        ($N minus 1 into $M)
        (grab $M from $MoreIn into $MoreOut leaving $Remainder)

(tokenize [] into [])
(tokenize $Symbols into [$CandidateToken | $RemainingTokens])
        *(token $CandidateToken)
        (split word $CandidateToken into $CandidateSymbols)
        (length of $CandidateSymbols into $N)
        (grab $N from $Symbols into $CandidateSymbols leaving $RemainingSymbols)
        (tokenize $RemainingSymbols into $RemainingTokens)


(token @north)
(token @south)
(token $Token)
        *(compounder-token $Token)

(compounder-token @と)
(compounder-token @、)
(compounder-token @,)


(parse direction list $Words [$Head | $Tail])
        (join words $Words into $Glob)
        (split word $Glob into $Symbols)
        *(tokenize $Symbols into $Tokens)
        *(compounder-token $Compounder)
        (split $Tokens by $Compounder into $Left and $Right)
        *(parse direction $Left $Head)
        *(parse direction list $Right $Tail)

#southroom
(room *)
(from * go #north to #northroom)

#northroom
(room *)
(from * go #south to #southroom)

#player
(current player *)
(* is #in #southroom)

Unfortunately there is no way to do this at the moment. I have been meaning to add a built-in for backtracking over every way of splitting a word in two parts, where the first part is always a recognized dictionary word—starting with the longest. So if you have north and northwest in the dictionary, and you get northwestern as part of the player’s input, then a multi-query to this predicate would first return northwest and ern; and then north and western. Such a mechanism could be used for your Japanese parser (with a multi-query), and also to implement removable word endings at the library level (with a normal query, always going with the longest match).

2 Likes