NLP / PEG and other parser approaches

dcsan · July 10, 2020, 5:12am

I’m working on an input parser for a javascript game, and looking at various approaches.

In 2020 I wondered what “giants shoulders” there are to stand on from NLP to compilers

I’m familiar with the NLP approach but wanted to ask what other options people have used, and thought I’d share some other ideas I’ve come across so far.

RegEx parsers

The basic crude approach is to use regex and matches. This actually gets quite a long way! But it’s a pain to write. An excerpt looks like this with named captures:

route(/^(?<action>take|get) (?<item>.*)$/i, Actions.take),
route(/^(?<action>drop) (?<item>.*)$/i, Actions.give),

which matches on take cake

This will get a lot more ugly with sentences with compound clauses:

route(/^(?<action>use) (?<item>.*) on (?<item>.*) $/i, Actions.give),

match: use key on door

Lots of code for the basics eh.

RiveScript is an example chatbot script language that has a “simple regex” format that allows you to do things like this and also deals with regex captures

+ what is your (home|office|cell) number
- You can reach me at: 1 (800) 555-1234.

+ i am # years old
- A lot of people are <star> years old.

But there must be a better way than asking creative writers to deal with this stuff amirite?
Some type of spec languages to say what works where?

DSLs

I’ll just mention “domain specific languages” here, as a way for users to talk to computers.
Usually this is where you define a specific way to communicate and the human has to learn that language to talk to the computer.
This actually seems very similar to how text adventures walk go west or use key on chest are more computer-friendly than “natural” language.
I’ve written some DSLs using ruby as its a language where code basically reads almost like prose. method_missing ftw!

Smokestack.define do
  factory User do
    name "Gabe BW"
    pet_name "Toto"
  end
end

It’s great when you write them but can be a nightmare using someone else’s, eg like AppleScript

Grammar based parsers

Given a fairly structured language.
There is a whole field of CS for more structured computer languages and “grammar definitions” and parsers for them.

ISHML

@bikibird has created something really interesting in ISHMaeL

if you haven’t I suggest you read this thread. You heard it here first folks!

example:

input:	take the glass slipper
verb	take / take
adj	glass / slipper
noun	slipper / slipper

I tried integrating into my nodeJS engine.

PegJS

PegJS is pretty well known in this domain.

I am still getting to grips with things like defining your lexicon and grammars but it was a very approachable on-ramp.
But I’m not quite sure how to define an adventure game grammar for PEG.
I couldn’t find much information on this topic. Here’s a grammar for JS itself.
Anyone have an idea?

other compiler parsers

Earley parsers are another approach here.

https://joshuagrams.github.io/pep/
which @JoshGrams is an expert on.

NLP approach

Separately I’ve looked into NLP parsers such as spaCy which give quite a good breakdown of parts of speech.
The results seems really good and because its based on existing training data, you don’t need to define your own grammar.
The output is less customizable for the specific domain of intfic

parts of speech (POS tagging) is an NLP way to figure out what are nouns/verbs etc. in a sentence. basically tokens per word.
Then dependency parsing will show you the sentence breakdown ie the relations between those tokens

Spacy even has a displacy for visualizing

And there are API/services like TextRazor for those who don’t want to run it all locally

LD A, (de)

In researching I found this great video about writing a parser from … back in the day. Man it used to be so much harder! The more things change…

can you remember jump tables?

What are you doing?

So those are some of the approaches I’ve found today.
It would be great to hear what other approaches people are taking, and specifically any details on better ways to use grammars like PEG

Sorry for the long post.

zarf · July 10, 2020, 3:34pm

For simple algorithmic approaches, the gold standard remains the original Infocom approach: separate noun phrase parsing from verb phrase parsing. Verbs forms look like

TAKE nounphrase
PUT nounphrase IN nounphrase
GIVE nounphrase TO nounphrase
GIVE nounphrase nounphrase

… and other patterns like that, which are really pretty simple in concept. Noun phrases are more complicated; the Inform manual gives you a good idea of what authors and players expect to be able to do. But this is how they get strung together.

The reason why IF systems tend not to use regexps, CFGs, or other programming-language standards is that the above grammar is inherently ambiguous! This is not something you can fix by improving your tools. You really do wind up with inputs like GIVE RED STEVE RED SWORD or RECITE SERMON TO THE EPHESIANS TO BOB and then backtracking is on the table. (Inform isn’t very good at this; Dialog is more sophisticated.)

Programming-language grammars are great at handling deeply-nested expressions like (3 * X + FOO(1, Y+2)). But this isn’t the kind of power that IF games usually need. Even the classic PLANT THE POT PLANT IN THE PLANT POT isn’t really a nesting problem (or an ambiguity problem, once you take the common assumption that the verb is first).

As for NLP, I think that’s wide open. The hard part is connecting the output of the super-duper-advanced NLP engine to the same-as-it-ever-was IF world model.

zarf · July 10, 2020, 3:56pm

I should say, this remains the best approach for English-language IF! I really don’t have a good sense of what problems arise in other languages that this plan is bad at.

bikibird · July 10, 2020, 4:05pm

I played around with PegJS prior to writing the ISHML parser, thinking that it was a shortcut to creating a parser for a new IF system. It’s cool, but the Javascript it generates is pretty much unreadable, which I didn’t like.

tayruh · July 10, 2020, 4:39pm

Parsing the input is just step 1 of creating a text adventure engine. It’s also one of the simplest parts, IMO. I accomplished the parser in my JS engine using only two functions, IIRC. It was everything else that made the engine difficult to code.

But as Zarf said, the parsing is tricky because you can’t just divide up your input based on typical sentence structure. You have to do it based on your defined verbs, nouns and adjectives, which may also have spaces in them. It’s still not that difficult, but it’s not something a single pass of regexp will be able to handle.

dcsan · July 10, 2020, 4:50pm

PLANT THE POT PLANT IN THE PLANT POT

I’m not gonna put potted plants in my game then

For the parts of speech detection in the sentence the context isn’t used I guess, its just overall stats.

Words	Contextual Entailment	Contextual Score	Prior Score	Total Score
plant	plant	1	1	1
plant	production	0.2709	0.01055	0.4194
plant	facility	0.4248	0.01662	0.6023
plant	factory	0.5586	0.01927	0.7089
plant	company	0.003919	0.01721	0.4458
plant	flower	0.7529	0.01085	0.6182
plant	tree	0.3611	0.01998	0.644
plant	site	0.1394	0.008447	0.3247
plant	grow	0.3979	0.008377	0.4267
plant	crop	0.4565	0.01255	0.5336

I’ll check later what spacy makes of it.

Dialog looks interesting but I can’t find much just on the parser functionality. The distro seems to come with a JS interpreter though which could be cool to try out.

dcsan · July 10, 2020, 4:52pm

but doesn’t that just become a library function you call to .parse() on?
still nice if it’s readable but you wouldn’t hand edit it after generating?

Or do you mean the API to traverse the parsed result (unlike your nice “fluent” api) ?

bikibird · July 10, 2020, 5:51pm

Yes, you sort of have to treat the generated parser as a black box. The grammar you provide is essentially the source code for it. If you ever misplace it and want to make changes to the parser, you are back at square one. It may not be a deal breaker for a lot of people, but it was for me.

jkj_yuio · July 10, 2020, 7:22pm

Zarf makes the relevant point here;

It’s all about resolving your syntax against your world model (semantics).

And nounphrases are the problem, not the overall sentence grammar. Which is why fancy tools aren’t the answer. In fact, you can make a game parser with something like bison or lemon, even though they aren’t actually designed for that.

However, you sometimes want to break the rules. For example to have certain word introduce patterns;

So, I’m using a parser to define words as well as to parse them.
eg “set stone feels cold”, parses as “set NOUNPHRASE prop value” where both prop and value are, not yet in the dictionary. The parser adds the new words as it goes along. This only happens for the “set” construct and is a special case and is used for game authoring.

Doing that with a traditional parser is more tricky as you’d, at least, have to fiddle with the guts of it. So it’s actually easier to write your own recursive grammar parser for games, then customise it as you go.

The output of your parser will be the syntax tree (AST) and your next problem is how to resolve this against your world model. The difficulty is that there is not just one AST, in general each ambiguous interpretation will be a different tree.

Your choices are:

have your parser emit a set of ASTs, then resolve each one later.
your parser resolves as it goes, and backtracks both for syntax and semantics.
parser emits a “canonical” AST, which can be transformed by context.

(1) is what programming languages do (mostly).
(2) is hairy and conflates your parser with your game, but doable.
(3) is my suggestion.

Let’s take an example;

put the pot plant in the plant pot.

Will parse syntactically as; (put (the pot plant (in the plant pot))).

Which is trying to say "put the plant that is in the pot … ", rather than “put the plant INTO the pot”

When you resolve, you’ll either have a plant in the pot, in which case you’re missing an indirect object, or you don’t have a plant in the pot, in which case your nounphrase will resolve to nothing;

Your game will either say “into what?” or “there isn’t a plant in the pot”.

Both are no good! And just when you thought you were doing well!

So this is the problem with (2), because do you resolve as you go eagerly or what?

Consider;

put the red cup and saucer in the bag on the table.

This has the most horrible number of syntax possibilities. In fact, it made one of my parsers keel over with recursion.

First bit;

does “red” scope over cup or over cup and saucer ?
same for “the”

(red cup) and saucer
red (cup and saucer)

(the (red cup)) and saucer
the (red (cup and saucer))
the ((red cup) and saucer))

And don’t think you can ignore “the”, consider

get the red cup and saucer on the table
get the red cup and THE saucer on the table.

The latter saucer isn’t red.

Second bit

put (X in the bag) on (the table)
put (X in the (bag on (the table)))

Where X is all the first parts.

But Also

We can combine saucer and bag;

put (the red cup and (saucer in the bag)) on the table.
put (the red cup and ((saucer in the bag)) on the table)

These last combinations ignore the other permutations with the scope of “red” and “the”, which are also there. eg is the saucer in the bag red?

Anyhow, if you enumerate all the permutations there are a lot, some aren’t obvious at all. I seem to remember it being more than 20 or so.

So this sort of thing is a good reason not to do (1) and also trying (2) will make your head hurt and your code into spaghetti.

So, another approach is (3), where your parser returns the “canonical” form; eg. (put (the pot plant (in the plant pot))) as a tree.

Then your resolver rearranges the tree as it tries various possibilities against the world.

Now, obviously, how does it know which tree rearrangements are valid? The answer is it doesn’t and that’s where you’re into a game domain rather than an NLP domain.

Things that you do in a game domain are;

change adjective scoping (ie adjective associativity)
indirect object lifting.
prepositional relative phrase scoping.

So, (2) means changing;

(put (the pot plant (in the plant pot))) → (put (the pot plant) in (the plant pot))

So now we have two nounphrases instead of one. Naturally, this transform is driven by “Put” requiring an indirect object, whereas “get” doesn’t.

consider;

put the pot plant in the plant pot
get the pot plant in the plant pot

(3) is where there might be more than one relativiser;

put key in the bag on the table

Is there a bag on the table or there a key in bag? You have to made putative transforms and test them against the world.

Conclusion

The meat is in the resolver, it’s almost all with nounphrases and what exactly is your world model anyway!

dcsan · October 13, 2021, 6:20am

I recently came across Ohm, another parser tool…