During a discussion in a different thread, it was mentioned that Inform’s parser doesn’t deal well with syntactic ambiguity. That, along with Graham Nelson’s recent (and exciting!) talk on the future of Inform, and the fact that my new job involves parsing natural language, has made me wonder…
Why doesn’t the Inform parser use some standard parsing algorithm, like Earley, and produce proper parse trees?
This seems like it could improve disambiguation by orders of magnitude, and Glulx’s memory model offers plenty of room for the data structures involved.
Is there a strong reason why this is a bad idea, aside from the fact that it wouldn’t fit in the Z-machine? And how much hubris would be involved in trying to craft a replacement?
Plugging in an off-the-shelf parser isn’t totally straightforward. You have to deal with resolution of context against the world model. If you do this too eagerly, you can fail.
Consider a “simple” world with just “on” and “in”;
> put the red cup and saucer in the bowl on the table.
put (the red (cup and saucer)) in (the bowl on the table).
put (the (red cup) and saucer) in (the bowl on the table).
put (the (red (cup and saucer)) in the bowl) on (the table).
put (the (red cup) and (saucer in the bowl)) on (the table).
put (the ((red cup) and saucer) in the bowl) on (the table).
put (the (red (cup and saucer in the bowl))) on (the table).
So it depends on which things are “red” and whether there’s a bowl on the table or not. amongst other things. it’s worse when words can be multiple parts of speech and when words like “and” can also be used to break up commands;
Is there an up-to-date article anywhere that constrasts and compares the existing parsers used in IF systems today?
If attempting a new parser, I would think the goals are neither to reinvent the wheel nor to solve all the problems of natural language processing. Instead, the focus should be simply on making something that is at least a little bit better than what exists today.
I guess what I’m looking for is a table of all the different types of sentences different systems can process today.
That question is hard to answer in a table. I don’t know how to describe I7’s capabilities without a long lecture with a lot of examples.
Here’s a quote from Inform’s Standard Rules. This gives you an idea of what the Inform parser can handle:
This tells you that Inform’s parser understands a VERB followed by zero or more tokens, each of which is a PREPOSITION (perhaps with variations) or a NOUN-PHRASE. This is simple and basically true. (Leave aside numbers and text topics for the moment.)
What this doesn’t tell you is how noun phrases work. This is rather more complicated – most of I7’s advancement over I6 is in the area of noun phrase support. Conditional synonyms, property- and relation-based synonyms.
The other thing it doesn’t tell you is how disambiguation is handled between possibilities.
The improvements mentioned in this thread are basically the intersection of all these domains. “Drop the plant in the pot in the garbage” is either “DROP [noun]” or “DROP [noun] IN [noun]”, depending on how you slice out the noun phrases.
Thanks, I think I’m finally getting it. It’s not enough to figure out what the list of noun phrases is. You also have to figure out how the noun phrases complexly nest in order to figure out where the dividing line is between the direct object and the indirect object and that is only knowable in the context of the world model and also taking into consideration whether the verb takes 0, 1, or 2 objects.
I tried some experiments in Inform 7 and the example I wrote didn’t seem to a very good job with handling noun phrses or with disambiguation. Most likely I’m making a beginner’s mistake, but the Inform parser is seems less powerful than I expected.
The Empty Lot is a room.
A garbage heap is in the Empty Lot.
The box is on the garbage heap.
A blue tin can is in the box.
A green tin can is in the Empty Lot.
You can see a garbage heap (on which is a box (in which is a blue tin can)) and a green tin can here.
put tin can in box on garbage heap.
I only understood you as far as wanting to put the green tin can in the box.
It did not ask me to disambiguate between “putting the tin can, which is in the box, on the garbage heap” and “putting the tin can in the box, which is on the garbage heap.”
So, did I just write bad code? Or does the inform parser just not understand commands with noun phrases modifying other noun phrases?
The Inform parser doesn’t understand that “tin can in box” means the tin can that is in the box by default. The only one of those that is understood by default is “take can from box” and a few synonyms–and that involves the dreaded “[things inside]” token which has a truly incredible amount of parser code devoted to it, for just that one case.
To make “in” and “on” understood by default you have to add Understand lines for them, as discussed here:
Understand "in [something related by reversed containment]" as a thing.
Understand "on [something related by reversed support]" as a thing.
So the parser isn’t that powerful by default, but you can build in some extra power using understanding by relations.
But then, as discussed further down that thread, you get the problems with “put can in box on heap” (or even “put the can in the box,” which really is unambiguous to ordinary readers). The parser grabs as much of the command as it can for the initial noun phrase, leaving nothing left over for the preposition or second noun. So both those commands get processed as, effectively, “put can,” and the parser asks what you want to put the can in.
The command isn’t parsed as well as we’d like, unfortunately! If you answer the question it’ll still say “I only understood you as far as wanting to put the green tin can in the box.”
I think what’s going on here is that in both cases the parser hits “tin can” and tries to figure out what it might mean before processing the rest of the command. In the original example, there’s a somewhat obscure algorithm that decides that we’re more likely to mean the green tin can than the blue tin can. (The parser usually prefers things directly in the room to things on supporters or in containers.) Then, having picked the green tin can, it processes as much of the rest of the command as it can–but the grammar line it’s working on is “put [something] in [something],” and that only fits “put tin can in box.” Since there are leftover words in the command, it gives you the “I only understood you as far as…” error.
In the second example, it tries to process “tin can” again–but now it’s unable to choose a can for itself, since there are two cans directly in the room. So it asks a disambiguation question. (The disambiguation process is called directly from the internal routine that tries to figure out what thing matches a string of words at a given part of the grammar line.) But the disambiguation isn’t restricted to the things directly in the room–when it’s actively asking a disambiguation question, it lists every noun that the words in question could refer too. So it asks about all three cans.
But then, when it’s got its answer, it still has to process the rest of the command… and it still uses up the grammar line and has words left over in the command. So you get the “only understood you as far as…” error again. (In fact, what the disambiguation does is insert the answer you give into the original command and process it again, so the result of answering “green” to the disambiguation question is basically what you’d get if you typed “put green tin can in box on garbage heap”… or it might be “put tin can green in box on garbage heap,” I’m not sure.)
…if the terminology I’m using is unclear, btw, “grammar lines” are those Understand statements.