Parser disambiguation: TWO bags of words?

JoshGrams · February 7, 2024, 2:34pm

This is mostly a thought experiment: I’m not proposing that anyone rush out and try to implement this in Inform or whatever. But I’ve always thought that a “noun/adjective” approach had potential. And it’s been nagging at me since the middle of last month so I finally sat down and put together some test code.

It’s fiddly to get the details right, but what part of an IF parser isn’t fiddly? And you can get it to the point where it handles a bunch of the common issues without author intervention, which is cool. So I thought I’d discuss it a little, even if it’s not practical for Inform now (or ever).

So. What if an object had two bags of words that name it: one set for “main” nouns, and the other for auxiliary descriptors (adjectives-ish)? And when trying to match a user’s input against an object’s name, the nouns score higher for disambiguation purposes. The object still has to match all of the player’s words (barring probably articles and prepositions) to be considered at all.

In English we can usually take the last word of a name as the core noun, and then the rest go in the “adjectives” bag. If there’s a prepositional phrase (do we only care about “of” or are there other common ones? maybe “with”?) we can take the first word before the preposition as the core noun:

“shovel” in “blue plastic snow shovel”
“portrait” in “big gaudy portrait of lord dimwit flathead”

In my code I also allow you to mark arbitrary words as core (+elephant+) or auxiliary (-purple-).

When a user types a name, we check the input words against the object’s name, scoring a large amount when one matches a noun (10? 100? any number that’s more than the largest number of adjectives that a player will ever type) and scoring 1 when it matches an adjective.

This lets “snow” prefer an object named “snow” or “heavy wet snow” to one named “snow shovel” because in the former case it’s matching a noun (more important) and the latter only an adjective.

To give us a tiny bit of word-order handling (“pot plant” vs. “plant pot”) we can also recognize the core noun of the input and double the score when it matches a core noun of the object name (only score it as an adjective otherwise).

“plant pot” matches itself with a 2-noun, 1-adjective score (we score the noun double because it’s the core noun in both the input and the object name).

But “plant pot” matches “pot plant” with a 1-noun, 1-adjective score: it has all the right words, but the core noun in the input doesn’t match up so we don’t get that double noun score.

And if we limit the object’s core nouns so they can match multiple times but only count score once, then we can handle adjective/noun ambiguity like “light light” too.

This is the tricky case: if we didn’t track which object nouns we’d already score, then if you typed the name “light light” it’d match both words against the noun for the “heavy light” object, making it match just as well as the “light light” one (or maybe even better if one counts as an adjective in the latter case).

But if we only score the object’s nouns once, then “heavy light” scores 1-noun (the other “light” matches but doesn’t score), while “light light” scores 1-noun, 1-adjective and is preferred.

If any of the input words aren’t matched at all, that object doesn’t match.

It handles synonyms in pretty much the same way as the single bag of words model: if you’re playing at the beach and you have a “plastic shovel” that’s also a “small spade” then the user can refer to it as a “plastic spade” just fine.

And of course, this fails to do the fancy stuff in some cases if (as in Inform, I gather?) it doesn’t have the input to score against by the time it gets to disambiguation. But if you have a parser that keeps that information around, this seems like a neat little upgrade? Maybe?

javascript code

Compute a score with:

inputMatchesObject(parseObjectName("INPUT"), parseObjectName("OBJECT"))

Some test output:

ok 15 "snow" prefers "snow" (20) to "snow shovel" (1)
ok 16 "plant pot" matches "pot plant"
ok 17 "pot" prefers "plant pot" (20) to "pot plant" (1)
ok 18 "plant" prefers "pot plant" (20) to "plant pot" (1)
ok 19 "plant pot" prefers "plant pot" (21) to "pot plant" (11)
ok 20 "light light" matches "light light"
ok 21 "heavy light" doesn't match "light light"
ok 22 "light light" prefers "light light" (21) to "heavy light" (20)

And the actual code:

const preposition = new Set(['of', 'with'])

const parseObjectName = (str, out) => {
	let core, prev
	const addWord = (word, implicitCore) => {
		if(word == null) return
		let isCore = implicitCore
		const m = /^(?:\+.*\+|-.*-)$/.exec(word)
		if(m) {
			isCore = m[0].charAt(0) === '+'
			word = word.substr(1, word.length-2)
		}
		if(isCore) {
			if(implicitCore) core = word
			out.core.add(word)
		} else out.aux.add(word)
	}
	out ??= { core: new Set(), aux: new Set() }
	const words = str.toLowerCase().trim().split(/\s+/g)
	for(const word of words) {
		if(preposition.has(word)) {
			addWord(prev, core == null)
			prev = null
		} else {
			addWord(prev)
			prev = word
		}
	}
	addWord(prev, core == null)
	return out
}

const inputMatchesObject = (input,object) => {
	let match = true, score = 0, core = new Set()
	const check = (set, iCore, iAux) => set.forEach(word => {
		const seenCore = core.has(word)
		if(object.core.has(word) && !seenCore) {
			core.add(word);  score += iCore
		} else if(object.aux.has(word)) score += iAux
		else if(!seenCore) match = false
	})
	check(input.core, 20, 1)
	check(input.aux, 10, 1)
	return match ? score : 0
}

bkirwi · February 7, 2024, 3:16pm

Haven’t checked your code out yet, but this sounds quite close to Dialog’s concept of noun phrase “heads”!

zarf · February 7, 2024, 4:06pm

I think this bit needs a closer look. What is the “core noun” of the input? A word can be “core” for one object and not another. A word can have several core words, so the input could have many of them.

JoshGrams · February 7, 2024, 6:04pm

@bkirwi ah, nice. Figures that Dialog would have a similar pattern. I like “heads” as a name, too. I’ll have to look up the code there.

Ah, sorry, one of those bits was intended as a definition. The last word of a noun phrase (or the last word before the first prepositional phrase). So it’s specific to a noun phrase, either an input argument or an object name.

An input argument can have only one, but objects can have multiple heads (if you declare them explicitly, or if you give the object multiple separate noun phrases as synonyms).

I’d like to see examples where an input has more than one head noun, especially examples where it would matter for disambiguation. I’m not trying to magically capture every nuance of grammar: just enough to maybe match players’ intuition a little better with less work from authors, without being so complex that it has tons of weird corner cases.

Draconis · February 7, 2024, 6:12pm

Figuring out the “core” of a phrase (what linguists might call the head) is actually a remarkably difficult problem to solve in the general case! It can be the last word (heavy book) or the first word (book of poems) or in the middle (heavy book of poems), and sometimes there’s no marker like “of” to tell you which (surgeon general).

Then you have “coordinate” structures, which arguably have multiple heads (bat and ball), and “exocentric” structures, where none of the pieces really acts as the core (a road hog is neither a road nor a hog, it’s a person who hogs the road)…it’s a mess!

jkj_yuio · February 7, 2024, 6:27pm

Prepositions

Firstly, prepositions break up noun phrases, they are not part of the phrase.

so;

(big gaudy portrait) of (lord dimwit flathead)

is two noun phrases. and,

(frame) of (big gaudy portrait) of (lord dimwit flathead)

is three.

This is the same as for normal usage, such as

put (the key) into (the bucket)

Some prepositions are for noun phrase resolution

put (the (key) in (the bag)) onto (the table).

The preposition “in” here, is part of the resolution of which key, when there might be another key around.

Of course this leads to problems such as

put the key in the bag on the table.

which might mean

put (the (key) in (the bag)) on (the table).

or

put (the key) in ((the bag) on (the table)).

Depending on whether there’s a key in a bag or a bag on the table, for example.

TWO bags of words

It is correct to have a bag of words for the adjectives. But you do not need a “bag” for the “core noun” (as you call it). There is one word for the noun and it precedes the preposition or is at the end.

You do not need a scoring for the main noun word. It is either a hit or not.

If you perversely want to have pure adjectives match noun phrases (like some people here do), you might want to treat this as a second pass when otherwise no such phrase is matched.

For example;

get orange

could match “orange juice” only when there is no orange things around.

so if we also had a “fresh orange” here, then get orange would never match the juice.

johnnywz00 · February 7, 2024, 6:35pm

In TADS3, nouns and adjectives are differentiated by the library, with nouns automatically taking precedence if there is a clash (fortress vs. fortress gate, etc.)

JoshGrams · February 7, 2024, 6:37pm

Aha! That gives me some ideas and search terms to try making bad corner case examples, thanks!

…off the top of my head I can’t make those break worse than with a simpler model, and they’re susceptible to the usual author work-arounds: “general” prefers “army general” to “+surgeon+ -general-” for instance. But that gives me places to keep looking.

Thanks! I’ll have to look at that code.

Yeah, poor word choice on my part. What I mean here is something like “object name”: what’s the linguistics term for the full thing?

In the case of parser IF you do, because objects can have multiple synonyms, and you want to be able to mix-and-match adjectives between different head nouns. As I said in my initial post:

Both “shovel” and “spade” are head nouns for this object, even though you will generally only use one of them at any given time (unless you’re typing “rusty rusty knife rusty knife” for fun, or something like that).

My code already handles that in a single pass, thanks.

Draconis · February 7, 2024, 7:10pm

“Noun phrase” is correct. Linguistically speaking, “big gaudy portrait of lord dimwit flathead” is a noun phrase, since you can use it where a noun is expected: for an IF example, “take (big gaudy portrait of lord dimwit flathead)”.

Not true in general, unfortunately. “Surgeon general” has it at the beginning. In English, this is mostly restricted to borrowings from other languages, but we have plenty of those! If I’m writing a game about a biologist, it’s not correct to refer to a Canis latrans as a LATRANS, but it is correct to call it a CANIS. (Or C latrans, if need be.)

Even in native words, though, there are multi-word compounds that resist analysis this way. Referring to orange juice as ORANGE is certainly no more wrong than referring to a hot dog (a type of sausage) as a DOG, or a nest egg (money that’s been saved) as an EGG. A nest egg isn’t a type of egg, or a type of nest; you need both words to get the meaning. A game that says “which do you mean, the dog or the hot dog” is no better than one that says “which do you mean, the orange or the orange juice”!

JoshGrams · February 7, 2024, 8:27pm

Of course, in an IF context…

> PASS HOG
Which do you mean:
  1. The obstreperous road hog
  2. The Harley-Davidson

Zed · February 7, 2024, 8:27pm

All this sounds pretty much ideal to me.

Let’s say that in the kitchen we have:

an open can of cat food with a plastic snap-on lid lying beside it
a carrot
a can of soup
an open can of beans
a plastic food take-out container with leftover risotto in it and a clear plastic lid on it
another clear plastic lid matching the take-out container lying around loose
a plastic cutting board
a torn piece of plastic cling wrap
a roll of plastic cling wrap
a 10cm per side cut piece of Lexan

I figure “get food” should require asking for disambiguation among:
the cat food can, the carrot, the can of soup, the can of beans, and the container of risotto

For “get plastic”, the Lexan seems to me to be obviously the strongest candidate, being the only thing for which “plastic”, singly, as a noun would be a natural usage for a native English speaker. But probably not strong enough that I wouldn’t want the game to ask for disambiguation.

How about “get food cover” ? That would be a weird usage for the cat food can cover: it requires considering both “cat food” and “can” as modifying “cover” instead of what I, as a native speaker of American English, think is the more obvious parsing of it, that “cat food can” is a noun phrase whose head is “can”. Yet, still, I wouldn’t consider “food cover” wrong.

And if the cat food can cover weren’t here, I think it’d be reasonable for the game to assume the player meant the unattached take-out lid and to not ask about disambiguation with the risotto lid. (This suggests a criterion for consideration: if there are two things and completing the action for one of them but not the other would require an implicit action, removing the lid, favor the one that doesn’t involve the action.)

I don’t really have a conclusion here; I’m just rambling about: this stuff is hard.

Draconis · February 7, 2024, 8:42pm

Oh, and here’s an especially pathological case.

You can see a cat and The Cat in the Hat here.

Should you be able to use anything less than the full title to refer to the book?

Zed · February 7, 2024, 8:54pm

I think “cat in the hat” and “cat in hat” should be close enough to count as the book. Even “cat” or “hat” alone if there are no other candidates for those words in scope, but, in the presence of an actual cat or an actual hat, “cat” or “hat” should be taken to mean the actual thing without need of asking for disambiguation.

So let’s make it more pathological! There’s not just a cat, but a cat curled up asleep in a ushanka! And, its name is Book!

This discussion suggests to me a useful development aid. For each word the game has defined as possibly referring to something the player might mention in a command, if that word could match multiple things, output the word and those things. That is, offer the author some guidance to where they might want to tune disambiguation.

johnnywz00 · February 7, 2024, 11:02pm

In the adv3 library, you can declare an object’s vocabulary like this:

fortress: Building 'old stone fortress/castle/keep' ;

fortressGate: Door 'wooden decaying fortress gate/gateway/doors' ;

fortressKey: Thing 'fortress castle key/keyring' ;

Words separated by spaces near the beginning of the string are loaded into the dictionary as adjectives; words separated by slashes at the end of the string are loaded as nouns.
If you were in scope of the fortress itself, X FORTRESS would not disambiguate with the gate or the key, because they are only “fortress” as adjective. If you left the fortress to a different location and still had the key, X FORTRESS would register as examining the fortress key.
If we could imagine that the gate and key were in the same scope without the fortress itself, X FORTRESS would ask for disambig between the gate and the key, since they are both “fortress” by adjective (unless one of them had a vocabLikelihood adjustement.)

pieartsy · February 8, 2024, 8:15pm

Yeah I read this and was like “TADS has this on lock”.

pieartsy · February 8, 2024, 8:30pm

in adv3lite

cat: Actor 'cat;cute fuzzy little furry orange small; kitty kitten animal'
    "A cute little kitten with orange fur. "

catbook: Thing 'cat in the hat;kids kid kiddie short;book'```
   "A short book for kids."

and if you want to be extra sure you can put a disambigName so that if the parser is confused, it can ask more clearly.

cat: Actor 'cat;cute fuzzy little furry orange small; kitty kitten animal'
    "A cute little kitten with orange fur. "
    disambigName = "cute kitten"
;


catbook: Thing 'cat in the hat;kids kid kiddie child short;book'```
   "A short book for kids."
   disambigName = "cat book"
;

i haven’t tested this to see how robust it is.

edit:

I have no idea how well this would work either but here’s how I’d do it as a first pass (and then bugtest)

ushanka: Thing 'ushanka;russian furry warm fluffy winter;hat cap'
   "An ushanka, one of those warm Russian winter hats."
   contType = In // determines the type of container it is
;

// the + means that Book is contained in the ushanka
+ cat: Actor 'Book;cute fuzzy little furry orange small; kitty kitten animal cat'
    "A cute little cat named Book, curled up in an ushanka. "
    disambigName = "Book the cat"
;

catbook: Thing 'Cat in the Hat;kids kid kiddie child cat hat short;book story'
   "A short book for kids."
   disambigName = "kids book"
;