What is a good way to process sequences of characters into tokens?

MultidimensionalStep · September 7, 2020, 7:24am

Hello there!

I am coming back to Dialog and was pleased to learn that it’s now possible to split a word into a list of characters, which had been something I had wanted in order to be able to attempt to parse Japanese input which doesn’t have spaces between words. Now that I have this, though, I’m a bit stumped.

What I would like to write is code to turn a list of symbols into (in a multi-query sort of way) all possible interpretations of the symbols into tokens. So for example if I had:

(token @りんご)
(token @を)
(token @食べる)

…and had input split into characters like [@り @ん @ご @を @食 @べ @る], I would like to be able to transform that into the list [@りんご @を @食べる], multi-querying for other possible token-interpretations. I think this is possible, I just am unsure how to write the code.

Any insights would be greatly appreciated!

MultidimensionalStep · September 8, 2020, 1:52am

Turned out to not be as difficult as I was suspecting, here’s my code:

(drop 0 from $Everything leaving $Everything)
(drop $N from [$Head | $MoreIn] leaving $Remainder)
        ($N minus 1 into $Nm1)
        (drop $Nm1 from $MoreIn leaving $Remainder)

(tokenize [] as [])
(tokenize $Symbols as [$Candidate | $Tail])
        *(token $Candidate)
        (split word $Candidate into $CandidateSymbols)
        (length of $CandidateSymbols into $N)
        (take $N from $Symbols into $Glob)
        (join words $Glob into $Candidate)
        (drop $N from $Symbols leaving $Remainder)
        (tokenize $Remainder as $Tail)

(token @りんご)
(token @を)
(token @たべる)

(current player #player)
(#player is #in #room)
(room #room)
(look #room)
        (tokenize [り ん ご を た べ る] as $Tokens)
        $Tokens

lft · September 8, 2020, 7:41am

Excellent!

Now that you have something that works, it’s time to consider performance. First, the following:

        (length of $CandidateSymbols into $N)
        (take $N from $Symbols into $Glob)
        (join words $Glob into $Candidate)
        (drop $N from $Symbols leaving $Remainder)

performs some unnecessary data transformations. This is more efficient (and terse):

        (append $CandidateSymbols $Remainder $Symbols)

But there is a bigger problem, which is that you iterate over every valid word and compare that against the input. This will become slow when the dictionary grows. A better approach is to present the input as the first parameter of a query. This is how the standard library deals with e.g. (understand $ as $) and (perform $), and the compiler is able to optimize it into a fast indexed lookup. Thus:

%% Note the spaces between characters:
(match [り ん ご | $More] as [り ん ご] and $More)
(match [を | $More] as [を] and $More)
(match [た べ る | $More] as [た べ る] and $More)

(tokenize [] as [])
(tokenize $Chars as [$Head | $Tail])
    *(match $Chars as $TokenChars and $MoreChars)
    (join words $TokenChars into $Head)
    *(tokenize $MoreChars as $Tail)

Unfortunately, the match table is a bit messy to maintain by hand. But it can probably be simplified using recursive rewrites, similar to how (grammar $ for $) is handled by the standard library. (Incidentally, that is why the match predicate is returning a list of chars instead of a complete token. An access predicate won’t be able to join chars at compile-time.)