Letting an LLM play interactive fiction

A guy named Fernando Borretti recently blogged about letting Claude (an LLM) play Anchorhead.

For me, this is one of those “obvious in retrospect” use cases. Not that we need AI to play our games for us, but text adventures are non-trivial problems that are nevertheless very easy to “plug” into an LLM. It’s all text! And some of the interpreters are CLI-based. Fernando’s harness is something like 30 lines of Python code.

His whole experiment is open source, by the way:

The results are noteworthy, in my opinion. Claude, despite being considered “super-human” by some people, really struggles with text adventures. I mean, it can solve tiny puzzles, but pretty soon it’s either you have to pay through the nose for tokens (so that the AI has access to the whole game history on every turn), or you use some kind of memory limit and then the AI starts chasing its own tail.

I guess this is all more interesting for the AI crowd than for the IF crowd. And I’ve personally been quite annoyed with the “let’s put AI in everything” hype. But I thought I’d share it here nevertheless since it’s such an obvious application (that nevertheless fails, at least with a naive approach).

3 Likes

This has been pointed out before, but there is actually one very practical use-case for this: testing unfinished games. Watching an LLM fail to solve your WIP is a great way to discover what kinds of mistakes beginners are likely to make and how well your game responds to unexpected input.

8 Likes

Interesting. I am not that technical so is there any transcript or video showing the results?

Every competition and newbie advice is to ensure games are well beta tested. The forum is a great resource for that, but having a permanent LLM model well-trained on how to play text adventures available for everyone to utilize to help with the ensuring games are stable and responsive as a first round of testing - that would be an awesome step. Perhaps comps could even ask for the llm transcript to ensure a level of quality. Sound like it could be a great step for the community. Maybe an IFTF sponsored project?

The thing is…people aren’t just beta-testing for depth. They’re testing for puzzles, game-breaking, writing, cohesion, reactions, QOL, etc. An LLM trained to play parser games wouldn’t stop to point out any of that.

I feel like this is worse than just asking submitters to get testers.

What does it accomplish? A more distant community? Because that’s another thing testing does—creates people who would play WIPs for each other because others played WIPs for them.

6 Likes

I would not want my work to be judged based on the ability of an LLM to play it.

5 Likes

I feel like some people aren’t aware that thorough testing and “hey can you play through this and tell me if there are any errors?” are two separate things.

If I’m asked to playtest, I’m going to try and go an extra mile. I’ll try to input commands people aren’t expecting me to input just to see what was and wasn’t accounted for. I’ll click one option repeatedly to see how the system can handle that. One of my standard procedures for checking input limits in games which have you name a character is pasting the Bee Movie script into the field. I’ll try to catch typos, visual alignment issues, glitches, bugs, whatever errors there might be, and then I’ll report on what I saw to the person I did the testing for. Naturally, I’m only human and I won’t catch everything, but I do my job the best I can.

The issue is that LLM won’t do this for you. If you trained an LLM to play through a classic parser game, then it’ll probably only know an optimal way to play through a classic parser game. It’ll know it can GET LAMP to progress so it’ll GET LAMP, but it won’t check if you can softlock yourself if you DROP LAMP or THROW LAMP, it probably won’t even think about LICK LAMP, and I don’t think it’ll particularly care about X LAMP. What if this isn’t a classic parser but something else entirely? What if you need to EAT LAMP? What if you need to TALK TO LAMP? And gods forbid if there isn’t a LAMP to be GET!

LLM also can’t enjoy itself, which is already damning enough for me because it means that even if there is a theoretical LLM which can deal with every single type of puzzle and parser and choice-based IF you can ever throw at it, it won’t tell you if going through any of this was anything enjoyable. If there was a theoretical LLM which could play through every IF ever and even test it in detail, all you’d have would be “so, those are technical things you messed up”, and maybe that’s enough for some, but a technically flawless game can absolutely be a slog to get through. It can be unrewarding. It can be great in theory and an absolute chore in practice. LLMs, which are majorly geared towards praising the user, won’t tell you “this is flawless but it wasn’t fun to play”.

9 Likes

Yeah, we’re writing games for humans, not for LLMs. LLMs can give useful, fast technical feedback, but if you aren’t eventually asking humans how they experience your game, who are you even writing it for?

1 Like

Perhaps LLMs could judge competitions as well

And write the games for them! A whole ecosystem.

Yeah, I could see the value of an LLM for stress-testing a parser game, but I worry that people would come to see it as a substitute for human testing.

2 Likes

With regards to automated testing, this is one advantage that choice-based games have as it can be relatively easy to make scripts to brute-force possible outcomes (no LLMs required). Three that I’m aware of are:

  • ChoiceScript Randomtest - for testing ChoiceScript, by @dfabulich, you can have it do thousands of steps and iterations through a game and you can see exactly what text is seen, or not, and how frequently.
  • Ink Tester - for testing Ink scripts. Works similarly to the above script (but for Ink).
  • ChoiceScript Tools - builds on top of Randomtest and provides additional tools (was just discussed in a recent Narrascope talk).

I did some searching and I couldn’t find a comparable tool for parser-based games. I suspect that there could be value in a fuzzing tool, for game engine developers, to ensure that the engine can handle malformed input (at least). Maybe having a script that collects all verbs and nouns and tries to apply them in random orders to see what outcomes emerge, to find unexpected edge cases. However, I suspect that the value for actually testing games would be more limited (the combinatorial explosion would be too great).

5 Likes

For Inform 7, there’s the Automated Testing and Object Response Tests extensions.

2 Likes

I’m in the ‘not automatically opposed to AI’ camp. I had the opportunity to use an AI agent in small portions of game testing. My game, by IF Comp standards, didn’t do great, so that may be enough for you to draw your own conclusions on how benefitting AI is to a final product.

I also did get some human play testers. Their comments were brief and polite. The transcripts were helpful. With the AI agent, I watched live input. It played like a novice IF user and offered feedback about what it reasoned based on my clarity. True, it didn’t come with the rigor and effort of a human, but in using this tool, it was my intent to make it a better user experience before humans used it. Maybe some folks in this community have established great connections with play testers that allow for honest exchanges of feedback, and that is definitely what I’d aim for, but human time is hard to come by.

Again I am in the camp that isn’t automatically opposed to AI and I am just posting to share, I am not interested in debating merits of its use.

4 Likes