I am coming back to Dialog and was pleased to learn that it’s now possible to split a word into a list of characters, which had been something I had wanted in order to be able to attempt to parse Japanese input which doesn’t have spaces between words. Now that I have this, though, I’m a bit stumped.
What I would like to write is code to turn a list of symbols into (in a multi-query sort of way) all possible interpretations of the symbols into tokens. So for example if I had:
(token @りんご)
(token @を)
(token @食べる)
…and had input split into characters like [@り @ん @ご @を @食 @べ @る], I would like to be able to transform that into the list [@りんご @を @食べる], multi-querying for other possible token-interpretations. I think this is possible, I just am unsure how to write the code.
Turned out to not be as difficult as I was suspecting, here’s my code:
(drop 0 from $Everything leaving $Everything)
(drop $N from [$Head | $MoreIn] leaving $Remainder)
($N minus 1 into $Nm1)
(drop $Nm1 from $MoreIn leaving $Remainder)
(tokenize [] as [])
(tokenize $Symbols as [$Candidate | $Tail])
*(token $Candidate)
(split word $Candidate into $CandidateSymbols)
(length of $CandidateSymbols into $N)
(take $N from $Symbols into $Glob)
(join words $Glob into $Candidate)
(drop $N from $Symbols leaving $Remainder)
(tokenize $Remainder as $Tail)
(token @りんご)
(token @を)
(token @たべる)
(current player #player)
(#player is #in #room)
(room #room)
(look #room)
(tokenize [り ん ご を た べ る] as $Tokens)
$Tokens
Now that you have something that works, it’s time to consider performance. First, the following:
(length of $CandidateSymbols into $N)
(take $N from $Symbols into $Glob)
(join words $Glob into $Candidate)
(drop $N from $Symbols leaving $Remainder)
performs some unnecessary data transformations. This is more efficient (and terse):
(append $CandidateSymbols $Remainder $Symbols)
But there is a bigger problem, which is that you iterate over every valid word and compare that against the input. This will become slow when the dictionary grows. A better approach is to present the input as the first parameter of a query. This is how the standard library deals with e.g. (understand $ as $) and (perform $), and the compiler is able to optimize it into a fast indexed lookup. Thus:
%% Note the spaces between characters:
(match [り ん ご | $More] as [り ん ご] and $More)
(match [を | $More] as [を] and $More)
(match [た べ る | $More] as [た べ る] and $More)
(tokenize [] as [])
(tokenize $Chars as [$Head | $Tail])
*(match $Chars as $TokenChars and $MoreChars)
(join words $TokenChars into $Head)
*(tokenize $MoreChars as $Tail)
Unfortunately, the match table is a bit messy to maintain by hand. But it can probably be simplified using recursive rewrites, similar to how (grammar $ for $) is handled by the standard library. (Incidentally, that is why the match predicate is returning a list of chars instead of a complete token. An access predicate won’t be able to join chars at compile-time.)