An experiment with parsing Japanese

MultidimensionalStep · September 8, 2020, 6:22am

I have found a way to parse Japanese without artificially requiring spaces between words. The trick is to break the sentence down into individual symbols, then transform this list of symbols into a list of recognized word-tokens. Currently it requires defining a (token $) predicate for every recognizable Japanese word-form which feels very onerous. I have tried telling Dialog that every dict word is a token (so only verbs and particles would need to be explicitly defined) but couldn’t figure out the right syntax for that. In any case, here is the proof of concept, which only lets you (try to) look north or look at an apple.

> りんごを見る
TODO: Look at object #りんご

> 北を見る
TODO: Look in direction #北

And the code:

(grab 0 from $Everything into [] leaving $Everything)
(grab $N from [$Head | $MoreIn] into [$Head | $MoreOut] leaving $Remainder)
	($N minus 1 into $Nm1)
	(grab $Nm1 from $MoreIn into $MoreOut leaving $Remainder)

(tokenize [] into [])
(tokenize $Symbols into [$CandidateToken | $RemainingTokens])
	*(token $CandidateToken)
	(split word $CandidateToken into $CandidateSymbols)
	(length of $CandidateSymbols into $N)
	(grab $N from $Symbols into $CandidateSymbols leaving $RemainingSymbols)
	(tokenize $RemainingSymbols into $RemainingTokens)

(token @きた)
(token @北)
(token @みる)
(token @見る)
(token @りんご)
(token @を)

#北
(direction *)
(name *)	北
(dict *)	きた

(understand $Sentence as [$Target を 見る])
	(join words $Sentence into $Glob)
	(split word $Glob into $Symbols)
        (tokenize $Symbols into $Tokens)
	(split $Tokens by [を] into $NounPhrase and $Verb)
	*($Candidate is one of [みる 見る])
	([$Candidate] = $Verb)
	{
		(understand $NounPhrase as direction $Target)
	(or)
		(understand $NounPhrase as any object $Target)
	}

(perform [$Target を 見る])
	(if) (direction $Target) (then)
		TODO: Look in direction $Target (line)
	(else)
		TODO: Look at object $Target (line)
	(endif)

(current player #player)
(room #room)
(#player is #in #room)

#りんご
(name *)	りんご
(item *)
(* is #in #room)

There is obviously a mountain of work remaining to make something useful out of this, but I am excited by the possibilities.