Please beta test this new parser -- unconstrained input using an LLM/AI (fun!)

mmarubio · December 18, 2024, 6:07pm

Hello fellow IF fans,

We are trying to draft LLMs into the IF world and thought a useful step might be using the tech to allow for uncontrolled and wide-ranging user input. In an attempt to get my kids to enjoy the games I loved in my youth, I found they quickly became frustrated learning and staying within the confines of the narrow syntax some games require. So we thought we could use the latest LLM/AI engines to allow for both unconstrained input, and more varied error messaging. Our theory is that it will accelerate the learning curve to get players more deeply invested in each game – and lower abandonment due to initial frustration.

That’s the theory at least. . . please try it here and let us know if this enhances gameplay – or detracts from it:

We wrapped Adam Cadre’s great game “9:05” for this first test. One, because it’s simply a terrific game; and two, because it was short enough that we could get all the exceptions/prompts to get a full play start to finish.

We have reached out to Mr. Cadre but were not able to contact him. So we hope this is OK testing on his story. Mr. Cadre – if you read this and have any objections, we will pull it down immediately. Or conversely, please contact us, we would love your thoughts on how LLMs might further be drafted into the service of IF.

dfranke · December 19, 2024, 12:58am

It doesn’t seem to be working too well even with commands that the underlying parser already accepts. On my first turn I tried X PHONE and got an error message because it didn’t recognize ‘X’. I then tried three times to EXAMINE PHONE. Twice it translated this input into EXAMINE NOTE and once into EXAMINE WALLET.

Gamefic · December 19, 2024, 6:23pm

My experience was similar to Daniel’s. Commands often got translated into something incongruous or changed the direct object to something completely different. In some cases, the translation omitted objects entirely, so “examine clothing” became “examine” or “drop watch” became “drop”. It was impossible to examine the desk in the cubicle, even though the desk has a working description in the original game. When I reached a “continue” prompt, the game stopped responding altogether.

dfranke · December 19, 2024, 7:08pm

A general thought on the idea, irrespective of how smoothly the implementation works: I don’t think an LLM as a frontend to a traditional Inform-style parser is going to provide value except possibly as a tutorial to help the newest of newbies learn how to phrase text adventure commands, at which point they’ll want to turn off the LLM and just issue those commands directly. Traditional parser syntax isn’t exactly intuitive, but it isn’t difficult either, and most people pick it up in a handful of minutes once they’ve been taught.

Parsers have never been the limiter of how rich a text adventure’s world can be. Decades before there were LLMs, classical parsing algorithms had already gotten pretty good at diagramming English sentences. They weren’t perfect, because human language is messy, but they were good enough to get through most newspaper articles without error. Parsing a newspaper is a much more challenging problem than anything a text adventure needs to be concerned with, because most English sentences have no reasonable interpretation as text adventure commands. For example, how on Earth should a text adventure be expected to respond to a player who types, “Parsing a newspaper is a much more challenging problem than anything a text adventure needs to be concerned with, because most English sentences have no reasonable interpretation as text adventure commands”? Implementing the semantics of more complex sentences and fitting them into the context of the game world has always been difficult part of the endeavor, and an LLM-as-frontend does nothing to move that forward.

Finally, if your goal is to assess whether the LLM can be helpful to newbies, then you should probably ask literally anywhere but here for beta testers, where you’ll mostly get answers from a mix old farts like me who started out playing ADVENT at age 7 by dialing into their dad’s BBS account, and older farts who played it on a teletype terminal at their university lab. It’s quite difficult for me to put myself back into a newbie mindset to interact with an IF parser, and even if I try my best you shouldn’t trust the feedback of me or people like me to representative of any actual newbie.

mmarubio · December 19, 2024, 8:27pm

Thank you for the feedback – I very much appreciate it. Added in the noun “phone” to the list of known nouns – had forgotttten that one – so the LLM was trying to guess around it. Shouls be fixed now. Thanks again.

mmarubio · December 19, 2024, 8:31pm

Thank you so much for the feedback. I very much appreciate it. This is the frustrating part of working wioth LLMs. . I cannot seem to recreate your errors . . . so when I type in drop watch – it puts in drop watch. Though when I just put in “drop” – it incorrectly translates it as “down” . . so need to work on that. I get thte correct phrase for examine clothing as well – If you would, could you try this again, and see if the errors repeat?

many thanks for taking a look at this.

mmarubio · December 19, 2024, 8:40pm

Daniel – thank you for your thoughful and thourough response – this is an excellent critique. very much appreciated.

And I think you are right – though i would argue that perhaps parsing is not the right frame here – we came out of the machine translation (MT) world – so we approached this issue as a translation problem, so that’s why we thought LLMs might work – as they (unlike neural models) could be drafted to do English-to-English translation.

For example, someone could write “roll over and go back to sleep”, “yell at the phone to shut up”, “pull the covers over my head and igniore the phone”, or 'do nothing" and the LLM (in theory) should trnaslte those to “wait” – a comand the game understands. So the thought (and I get your point, we might be off target here) is that we could eliminate the frustrating error mesaging that takes players out of the game and into the “trying to figure out constraiend syntax” mode.

As a fellow old-fart I found that when I came back to IF after a decade-long hiatus, I found myself frustrated with relearning the syntax and/or getting 10 error messages in a row (mostly due to my misspellings. . whuch are getting worse as my phone’s auto-corect gets better).

Thank you for the insights –

Best,

michael

dfranke · December 19, 2024, 8:50pm

I think autocomplete and autocorrect dictionaries informed by the game world would be great additions to any parser, although the details of what should be in the dictionary are a bit fraught because you need to prevent it from accidentally spoiling puzzles. I think there have been some forum discussions of this recently. My phone’s built in autocorrect is actually really frustrating when I try to play parser games on my phone, because it often autocorrects things inappropriately and only after I’ve already hit enter (e.g., I recently went north three times in a row, and now my phone assumes a later NE must have been a typo and should have been another N). But something specialized could certainly do better.

Mike_G · December 19, 2024, 9:29pm

A quick and dirty way to do this (for parser) might be to have a standardized command like $words that prints out a list of dictionary words the author selected as being non-spoilery, and the interpreter being smart enough to utilize this output to feed its own autocorrect database for that game. Would require work by game authors, terp authors, and the command being typed by the player - or an interpreter option that triggers the command automatically.

I like $words because it looks like Swords.

Gamefic · December 19, 2024, 9:56pm

This time I was able to examine the desk, but I had trouble leaving the cubicle. “Go out,” “exit,” and “leave the cubicle” all failed. Then I tried “go out” a second time and it worked.

“Yell at the phone to stop ringing” got translated into “No match found.”

I’ve long been curious about how LLMs might be used to enhance parsers. Unfortunately, in this example, the LLM layer seems to create more problems than it solves.

First and foremost, I’d expect the same command in the same circumstance to produce the same result. (I don’t know the details of your implementation, but in general terms, setting the temperature at or near zero might help here.) Second, the LLM layer seems like it’s prone to mistakes simply because it doesn’t have enough insight into the world model and its current state. Third, the error messages actually wind up being less helpful, because the parser is responding to the translation instead of the actual user input.

You might be able to get better results by feeding the LLM more context and giving it stricter guardrails. At some point, however, it might become more like a reimplementation of the existing parser instead of an enhancement to it.

mulehollandaise · December 20, 2024, 3:52am

I actually have explored a very similar idea a few weeks ago and managed to make a rough python prototype in an hour.

I do think there’s benefits for such a vision, beyond the applications that people mentioned here; for instance correcting typos, helping with synonyms or near-misses, helping with younger players that don’t want to constantly be told they didnt word it quite right, and helping with cleaning up voice input.

On the other hand, leveraging LLMs for this has its own issues (trained on stolen content, large environmental footprint, large/unscalable cost). I believe it would be feasible to develop a smaller, self-contained model trained on labelled content, and then it could run locally on any GPU; but this would take longer than an hour to prototype… (Great idea for a project to submit to IFTF microgrants next year…)

I had envisioned this as a tool that authors could choose to use (e.g. a Vorple library), and in that vein was thinking it should only be called when a command failed to parse (via the right parser hook). If you’re trying to do this for any game, i.e. as a terp feature, it’s harder, cause parser failures aren’t super easy to detect in text.

I think efforts to give it more context might be a lot of work that wont lead to widespread adoption by authors… I think interface wise more transparency is better: “Parsing failed, let me try to reformulate it…” and then “Reformulation failed, try another way!” would be acceptable for a lot of players. And with that it’s not expected to reach 100% success rate, just a helper that isnt infaillible.

Leaving my Gemini prompt here in case anyone is interested in playing with it:

I want you to simplify English language input for a text-based adventure game, by converting phrases like 'I want to grab the shovel' into concise commands like 'take shovel'. Please provide the simplified command and consider context, grammar, and punctuation. Correct spelling mistakes. Omit articles and adverbs but leave prepositions. Prioritize verbs from the following list: 'ask', 'attack', 'blow', 'burn', 'buy', 'climb', 'close', 'consult', 'cut', 'dig', 'disrobe', 'drink', 'drop', 'throw', 'eat', 'empty', 'enter', 'examine', 'exit', 'fill', 'take', 'remove', 'give', 'go', 'insert', 'jump', 'kiss', 'listen', 'lock', 'look under', 'search', 'open', 'unlock', 'pull', 'push', 'wear', 'rub', 'search', 'show', 'smell', 'squeeze', 'swing', 'switch on', 'switch off', 'remove', 'taste', 'tie', 'touch', 'turn', 'wave', 'wear'. If there are several actions, separate them with a period, for example 'unlock the door then open it' becomes 'unlock door. open door'. Expand the pronoun 'it' to the last noun you encountered. The input I want you to reformulate is: '"+str(input)+"'. What should be the simplified command? Answer with the simplified command and just the simplified command.

Angstsmurf · December 20, 2024, 1:57pm

it should only be called when a command failed to parse (via the right parser hook). If you’re trying to do this for any game, i.e. as a terp feature, it’s harder, cause parser failures aren’t super easy to detect in text.

I’d think a necessary step in order for something like this to be useful, is to only step in once the traditional parser has tried but failed to understand the input.

EDIT: I’m envisioning something like Smarter Parser by Aaron Reed, which intercepts the printing of the standard parser error messages, and re-analyses the player input, mainly looking for common beginner mistakes. That is really all the parser hook you need.

On a tangent, I wonder if an LLM or similar machine learning might be used as a programming tool, where the human programmer first creates a command the standard way, and the “AI” then automatically generates the game code for a lot of synonyms of this command, which the human programmer then can edit and curate.

dfranke · December 20, 2024, 3:40pm

I’ve contemplated writing a linting tool that uses WordNet to suggest missing synonyms for authors to add to their game, and I do think this would be useful. An earlier version of this idea, which I rejected as soon as I dug into it, was that authors could associate actions with WordNet word-senses and then a tool would automatically insert every relevant synonym. What caused me to reject this was that WordNet includes many word-senses that are highly idiomatic and differ strongly from the word’s ordinary meaning. For example, one sense of “touch” is as a synonym for “consume”: “She didn’t touch her food all night”. I would be very annoyed as a player if, when I typed “touch widget”, the game decided I must have meant “eat widget”.

Angstsmurf · December 20, 2024, 4:24pm

Well, I’d say that depends on how many of these irrelevant synonyms are generated, and how hard they are for a human to identify and remove. As long as getting rid of all the bad ones manually is not a prohibitively huge amount of work, it might still be a useful tool.

stormi · December 20, 2024, 4:50pm

I’ll repeat here what I wrote about synonyms on the French IF discord server. The context is a discussion about @mulehollandaise’s similar experiment he mentioned above. I think it’s pertinent for this conversation.

While reflecting on the same idea some time ago, I had a thought about synonyms: it’s quite common to intentionally limit the number of synonyms recognized for a given object to avoid questions from the parser.

For example, in a game where you’ve collected various notes found throughout the gameplay, you’d want to have all of them in your inventory without consulting any one of them becoming a tedious input operation.

So, you intentionally give them a specific name reserved just for them: a note, a leaflet, a piece of paper, a message, a flyer, etc.

Of course, some might go further and add “paper” as a synonym for all the objects, but they’d ensure that priority is given to the piece of paper without raising questions.

That was just to share a reflection on the idea that, in some cases, what we call the rigidity of the parser is also the precision of the parser (this can also apply to verbs in certain cases).

However, I remain convinced that having assistance—at least as part of onboarding (while explaining how the command was translated, both to avoid misunderstandings and to guide users toward mastering the parser’s syntax)—and as a support system for people who struggle with spelling or make input errors, will be very useful.

mmarubio · December 20, 2024, 10:23pm

Thanks again, Gamefic. Adding the cubicle commands – interestingly, in the original game I needed to write ‘get out’ twice in order for it to fire . . .not sure why that glitch exists. But added the equivalents you suggested – thanks
Michael

mmarubio · December 20, 2024, 10:29pm

This is very good advice – thank you. @Gamefic @dfranke @Mike_G @Angstsmurf I sense you are all correct that we need to let any command that delivers a valid response from the interpreter stand on its own – with no interference from the LLM.

We are going to implement that change – should take a few days depending on our schedules – and I would love to put that back in front of you all if you would be willing to have another look.

i suppose this would the Hippocratic oath of IF – First, do no harm. . or First, don’t mess with the parser if the entry is right.

mmarubio · December 20, 2024, 10:32pm

Daniel I think you talking Chomsky here – and don;t get me started on how badly Chomsky underestimated the vastness of the combinatorial possibilities of language. Though I bet if we had a universe of a million players and assembled all the ways they phrased things, I’ll bet we could get the possible ways of playing a game down to 10 place values.

mmarubio · December 20, 2024, 10:33pm

Petter - thank you. Will take a look at Smarter Parser – excellent place for me to learn.

Michael

dfranke · December 21, 2024, 1:23am

If you think there is some trace of Chomsky’s influence in my thinking about linguistics, please identify it so I can purge it with fire.