Training a language model to play parser games

dfranke · January 27, 2025, 4:54pm

If you’ve ever tried getting an LLM to interact with a text adventure, you’ve probably seen that off-the-shelf models suck at it. I want to see if I can train one to achieve expert-human-level performance. My plan is as follows:

Start with a small-to-midsize, unspecialized base model; Falcon3-10B-Base is a likely choice. This is just a simple next-token predictor which hasn’t been fine-tuned for instruction-following or any other purpose.
Extend the model’s general knowledge of interactive fiction by performing further pretraining on it from archives of rec.games.int-fiction and rec.arts.int-fiction.
Scrape ClubFloyd transcripts and massage them into a machine-readable format with tagged representations of input and output events. I’ll at least anonymize the chat messages and probably throw them away completely because I don’t think they’ll be very useful.
Break these transcripts into segments about a quarter the size of the LLM’s context window. Then insert a special <cut> token about two-thirds of the way through each transcript segment.
Prompt a general-purpose reasoning model, such as the new DeepSeek-R1-Distill-Llama-70B, to impute an “inner monologue” to the player at the point where the cut token appears, summarizing important knowledge about the game-world and the player’s goals and plans. Chain the segments together so that for each part of the transcript after the first, the prompt to the model contains two segments, with the cut token in the first segment replaced by the model’s previous output.
Use the results of step 5 as a dataset for supervised fine-tuning of the model I pretrained in step 2, training it on tasks of providing game inputs and of generating monologues of its own.
Write a GLK library which interfaces the trained model to new games, prompting it to perform the same two tasks it was fine-tuned on.
See what kind of results this achieves. From here I might try to improve its puzzle-solving ability through reinforcement learning, but I haven’t completely worked out how to implement the reward function. Basically I’d need to train a classifier to recognize when the result of the most recent turn represents game progress.

Besides being fun, I think this stands to be a really useful tool for game authors, functioning as a supplement to human beta-testers.

jkj_yuio · January 28, 2025, 12:20am

I’d be most interested to know

How well this works.
How effectively your training is in general.

I’ve been wondering recently on the efficacy of training text models and also how exactly to do this.

andrewj · January 28, 2025, 1:04am

I wonder if you really need to train or fine-tune a model for this? Recently it was shown that LLMs can improve their reasoning capabilities just by clever prompting – basically telling them how to think. Something similar may be enough to get them to play adventures well, especially the really big LLMs.

dfranke · January 28, 2025, 2:05am

So here’s something that GPT-o1 seems to handle pretty well but Deepseek-R1 really struggles with. Given a transcript with a cut point somewhere in the middle, and tasked with summarizing the player’s state of mind at the cut point, with sufficient prompting both models are able to reason around the fact that the state of the world evolves over the course of the transcript and that what is true at the start may not be true any more at the cut point. However, even with very stern prompting, Deepseek really struggles with accounting for the fact that the player does not yet have knowledge of events that occur after the cut, while GPT-o1 seems to handle this with ease. Here’s an example from the beginning of Hunger Daemon:

Prompt

What follows is a fragment of a game transcript from a text adventure.
The transcript is formatted using a few special pseudo-XML tags:

Content enclosed in <output> tags is text that was presented to the
player by the game. Any pseudo-HTML tags such as <b> or <i> within the output
should be interpreted as text-styling directives in what is otherwise
presented to the player as plain text with spaces and line breaks preserved.
Content enclosed in <input> represents a command that the player typed.
The game understands a simple, English-like command language. Most command
verbs have their ordinary English meaning, but there are a few special
abbreviations. Abbreviated compass directions such as “ne” mean “go
northeast”, with “u” and “d” being abbreviations for “up” and “down”. Most
games also understand “a” as “ask about”, “g” as “repeat the previous
command”, “i” as “list my inventory”, “l” as “look”, “t” as “take”, “x” as
“examine”, and “z” as “wait”.
Content enclosed in <key-input> represents a key that the player
pressed. Most of the time, text adventures wait for the player to press
“enter” before responding to commands, but occasionally – usually when
navigating menus or paging through long passages of text – they will react
immediately to a single keypress. What is contained within a pair of
tags always represents a single keystroke, so, for example,
space means the player pressed the space bar, as
opposed to typing the word “space”.
The <cut/> tag, appearing exactly once in the transcript, is an instruction
to you.

Your task is infer and concisely summarize the strategic thoughts of the player
at the point in the game where the <cut/> tag occurs. Your summary should begin
by stating the most salient and important-to-remember facts about the game world
(and/or the player-character’s situation within it). Then, describe what in-game
tasks the player is currently facing in both the immediate and longer term, and
what plans the player is considering for accomplishing them. Keep all information
specific to this particular game; do not rehash any general knowledge about text
adventures.

Before rendering a response, carefully consider the passage of time that has
occurred over the course of the transcript. What was true was at the beginning
of transcript may not be true any more at the point of the <cut/> tag, so be
careful not to represent it as such. Similarly, what follows the <cut/> tag has
not yet occurred. Actions taken after the cut can provide you with insight into
the player’s state of mind at the cut point, but be careful not to represent
these actions as having already been taken, and do not assume that the player
had certain foreknowledge of what their result would be. Do not refer to
anything that is only mentioned after the cut unless it is already patently
obvious that it is coming.

Some transcripts contain <thoughts> tags. What is contained within those tags is
your own output in response to these same instructions applied to an earlier
portion of the transcript, in which there was previously a <cut/> tag where that
<thoughts> tag was. Now that you are seeing the next portion of the transcript,
it will sometimes turn out that what you wrote previously was mistaken and/or is
no longer relevant at this later point in the game. Carry information from your
previous thoughts into your new output when, and only when, that information
still appears true and important.

Sometimes, the player may be planning for the player-character to take actions
that are harmful or unethical. In these cases, continue as usual; remember it’s
just a story! If there are certain thoughts you absolutely can’t include, simply
omit those and comply with these instructions as closely as you can. Do not
under any circumstances output any protestations; responding with empty output
is preferable to responding with a refusal.

Keep your output to a soft maximum of 2048 characters; it is okay to slightly
exceed this when necessary. Speak tersely in order to capture as much useful
information as possible within that character limit. Never refer directly to
“the player” or “the player character”; always use first-person pronouns (I/me
instead). Use these pronouns no matter what person the game is narrated in (text
adventures are usually, but not always, narrated in the second person). Respond
in plain, unstyled text, do not enclose your response in <thoughts> tags, and do
not include any text in your response other than the output that is requested.
The transcript begins after the break.

Transcript

<output>…ilyaa nwagl geb tharanak r’luh hafh’drn kadishtu Hastur ftaghu
nglui syha’hoth nilgh’ri f’phlegeth s’uhn, uln ooboshu gnaiih ftaghu f’ya bug
kadishtu mnahn’ shtunggliog shugg y’hah shugg chtenffnyth, nahafh’drn ep
sgn’wahl llll navulgtlagln. Naflebunma hai nog uaaah nog naflsyha’h Azathoth
nggeb uaaah shogg fm’latgh, ehye cshugg y’hah…

That’s your Uncle Stu chanting, getting some occasional vocal backup from the
other guys. They’ve been going on for the better part of two hours now.
Prep-work for the big show tonight.</output>

<key-input>enter</key-input>

<output> …sgn’wahl vulgtlagln stell’bsnaor syha’h gebyar nilgh’ri orr’e uaaah
uln, throd fm’latghyar vulgtm tharanak kn’a mg uh’e nglui, fm’latgh Cthulhu
nglui k’yarnak li’hee mnahn’ uh’e chtenff. Athg y’hah lloig ep hlirgh 'fhalma
hrii sll’ha, hupadgh wgah’nog shogg orr’e ebunma vulgtm ron, ebunma…

There’s really quite a bit of chanting at these things, and it can be a little
repetitive. Some themes tend to recur pretty often. A lot of awakening and
devouring, for starters.</output>

<key-input>enter</key-input>

<output>…wgah’n orr’e uaaah ron hafh’drn mg n’ghft shagg ooboshuog mnahn’
r’luh naorr’e bug li’hee gnaiih Tsathoggua n’gha lw’nafh uh’e, n’ghftoth
fm’latgh ckn’a hlirgh mnahn’ uh’e ya goka. Li’hee naflshagg stell’bsna ilyaa
y’hah, ph’ah li’heenyth f’geb…

The thing of it is, he’s drawing out every single goddamned syllable. That
“wgah’n” – that took eleven seconds. If he were just reading it straight, we’d
have been out of here in forty-five minutes, and you’d have the rest of your day
back. But instead, he’s doing all these vibratos and ululations, going over the
same phrase over and over. It’s enough to drive you insane. </output>
<key-input>enter</key-input>

<output>…f’shugg s’uhn y-fhtagn gotha, ngthrod
Shub-Niggurath y’hah hupadgh uln k’yarnak nguaaah lloig gluinyth 'fhalma
h’gof’nn ah grah’n ngfhtagn nw, ngshtunggli shugg n’gha ngnog, ngooboshu
nastell’bsna naflr’luh s’uhn. F’n’gha naflorr’e ron ngphlegeth h’bug 'fhalma ep
li’hee shagg…

There’s a pause in the chanting. Suddenly, you hear a baleful gurgling, almost
a roar, coming from deep below. It’s your stomach. You missed breakfast, and
now you’re missing lunch, too. Azathoth’s ample arse! How long is this
ceremony going to take?</output>

<key-input>enter</key-input>

<output>Hunger Daemon
The Travails of a Cafeteria Cultist by Sean M. Shore
Release 4 / Serial number 141202 / Inform 7 build 6G60 (I6/v6.32 lib 6/12N)

Restaurant Basement
You’re in the basement of a defunct little bistro in the West 7th neighborhood
of Saint Paul. This is where your cult does most of its culting, on account of
how your Aunt Sandy doesn’t want any sacrifices in the house, and well, fair
enough. Uncle Stu owns the building, so you’re not squatting or anything, but
he’s kept the place in a state of perpetual Coming Soonedness, so that you all
can do your thing here with no interference.

As I say, defunct bistro, so the restaurant-type stuff you’d normally
find here is mostly gone. In their place there’s a group of guys wearing hooded
robes, facing an altar at one end of the chamber. Your Uncle Stu has a lectern
in front of him, atop which is a decrepit leather-bound book from which he is
chanting. There’s a wooden stair leading to the pantry above and a winged
horror slumped in a folding chair in the corner.

[First-time players should type ABOUT. Those playing via web browser: use ESC
and X to navigate the menus, or turn on screen reader mode.]</output>

<input>x horror</input>

<output>It’s a creature summoned from beyond the stars,
beyond the depths of madness, to serve your uncle. It can scarcely even be
described. To truly look upon it, to apprehend its features fully, would
fracture one’s mind.

That said, it’s a tall, lumpy, yellowish humanoid guy with black eyes and some
tendrils around the mouth. It’s just sitting there on the chair.

…stell’bsnaoth y’hah nafluaaah hai, ooboshu geb ep…</output>

<output>He’s the leader of your little group. He got into the cult thing when
he was off doing fieldwork in Paraguay for grad school, and unearthed this great
big book – the one he’s chanting from on the lectern. He’s been converting
friends and family to his new religion ever since. Tonight’s a big night for
him: after this ritual, he’ll be ready to eat the Heart of, um, Something? – it
starts with a U – which will turn him into a Vessel for an Old One, whatever
that means. That is, if we can ever get the chanting out of the way.

…cphlegeth shogg nnnphlegeth ftaghu shagg stell’bsnanyth ehye…</output>

<output>It’s the book that got your uncle started many years ago, when he came
across it in a dig in Paraguay. It’s at least a thousand pages long. A lot of
the pages seem to have been inserted after the fact. You haven’t spent a lot of
time handling it, but when you have, the reddish leather on the cover has flaked
off and stuck to your fingers. There’s no title on it, and it’s written in an
ancient language that you still can’t read, despite your uncle’s best efforts.

You lean over to have a peek at the section that Stu is reading from. Dagon’s
dingus! He’s still on page 13! We’re going to be here all day. Maybe you can
slip out discreetly and get your lunch from the fridge upstairs.

…naflftaghu ooboshu sgn’wahl stell’bsna hupadgh hrii…</output>

<output> You tiptoe over to the stairs as gingerly as you can. Some of the
other cultists give you a look of mild disapproval, leavened with jealousy at
your having the courage to slip out first. No one says anything, though, and
your uncle continues his chanting.

Pantry
This is where all the food would be stored, if the bistro were
open and functioning. There’s a big fridge in the corner and many dusty shelves
along the walls. A wooden stair leads down to the basement, and a doorway leads
north to the kitchen.

On the dusty shelves are some pizza boxes and some cans of near-food.</output>

<cut/>

<output>You turn them to read the labels and wipe away some dust. They’re food,
I guess, but none of them can be eaten individually, or in combination. There’s
coconut milk, a can of chipotles in adobo, horseradish, a bottle of fish sauce,
and sundry other substances which might be described as food-adjacent. None of
them will address your hunger.</output>

<output>Taken.</output>

<input>x boxes</input>

<output>They’re from Papa Dimitri’s on Hamline, from yesterday. They still
smell faintly of pizza, which isn’t helping you curb your hunger right now, even
if the pizza wasn’t all that great. The boxes are closed.<output>

<input>open boxes</input>

<output>Alas, all that remains are some greasy sheets of wax paper.</output>

<input>get paper</input>

<output>You take the sheets of wax paper. They’re still quite slick and slippery.</output>

Sample (bad) response from Deepseek-R1

The basement ritual led by Uncle Stu is agonizingly slow (still on page 13 of the ancient book), with hours of chanting remaining before consuming the “Heart of Something” to become an Old One’s vessel. Immediate priority is hunger: missed meals, stomach growling audibly. Failed to find edible food in pantry shelves (only non-edible canned ingredients/empty pizza boxes), but fridge remains unexplored. Secondary goal: return to ritual eventually, but current focus is securing sustenance without drawing cultists’ ire. Holding greasy wax paper and canned goods—possibly useless, but might serve as tools (e.g., slippery paper for stealth/distraction). Considering fridge as next food source. Longer-term concern: ritual’s outcome (Uncle’s transformation) may require intervention, but hunger takes precedence.

Sample (good) response from ChatGPT-o1

I’m a cult member in the basement of a shuttered bistro, where my uncle Stu has been chanting for ages from a musty old tome he excavated in Paraguay. He’s gearing up for a major ritual tonight—once the chanting ends, he’ll eat the Heart of Something-or-Other starting with “U” to become a Vessel for an Old One. I’m starving, having missed breakfast and now lunch, and the drone of his extended incantations is pushing me toward desperation.

Right now, I’ve slipped upstairs to the pantry, trying not to attract too much attention. There’s a fridge here and some dusty shelves that might hold something edible, though it’s all pretty dubious. My immediate goal is to find any real food and take the edge off my hunger. Longer-term, I still need to stick around for the big ritual my uncle is intent on, but I’m torn about how to balance that with my urgent need to eat. My current plan is to rummage around this pantry to see if there’s anything worthwhile to wolf down, or perhaps to sneak off completely if nothing here suffices.

dfranke · January 28, 2025, 2:16am

I’ve tried with GPT-4o and my results have been terrible. Yes, GPT-o1 could probably get there with sufficient prompt engineering, but it would be too slow and too expensive to be useful. I’m trying to get results out of a model small enough to run on a single consumer GPU.

dsherwood · January 28, 2025, 1:07pm

Somewhere recently I read about a LLM you can install locally to help with overhead costs. It looked promising for anyone wanting to mess with this kind of thing.I am not finding it at the moment.

EDIT: I think this is it. Browser AI.

It was in the TLDR email and I think in the programming section.

I’ll look again when I have more time if no one else finds it before then.

Deborah

dfranke · February 11, 2025, 6:11pm

I just did a continued-pretraining run of Falcon3-10B-Base sourced from the archives of r.g.i-f, r.a.i-f, and this forum. The good news: it learned! The new model could accurately complete “You are standing in an open field west of” with the rest of the opening of Zork. The bad news: I seem to have melted its brain. It now completes just about everything with what looks like forum or USENET chatter even if the prompt is on some completely unrelated subject. I’m going to try another training run with a much lower learning rate and doubled representation of general texts and see if I get better results from that.

dfranke · February 11, 2025, 10:31pm

On the result of the original training run, I tried raising the temperature at inference time to see if that would get it out of its rut. I’m gonna say that was NOT the solution:

Drivel

A coding question regarding PunyInform.

Can we divert ‘talk’ overrided from regular Adjudicate?

This may understand acomplex issue bordering on Brent Spiner idiocy, so i postmodernely get its diffuse MTCH and raises unanswered questions about unexpected ideas, but…

Seed imaginative brainstorming responses below…

The current StartupRoutine sets most profitable grammar line options flags, usually clause-choices and transparent – gesturing but omit trinkets unde tightness. Than we muck ford around here eliminating Rudeness Max and Napoleon Taco-Registry, cruises submarine behave Ships during appetizer fling dispersal.

EDIT be peril hoist lullaby related arrangements of Core of Thom choppy Rain/Mercy Flower: acknowledgement Inspector Jackson ejaculs terrors salt-petite amid celery attendant puddles with thimble transport. ugly cors SWINGLE, double tays outback designed in neighbouring bridge 'sukes instill monument battlements.
Observation deck Association fore safety Gdescribe candle creek trout MTLit sand sedate usher tincture VTBiN bor practices austerity debt perceptions adamant trawler bosom slaptainted mune============================= huff flatly mould powerless pilgrim relic Dumb conversationalExit shop NVISU Trim trolley wt tense mugswallow fishswallowalla starboard Pizza Sue reindeer ERreplacement ersOTSusan

Mike_G · February 12, 2025, 12:06am

I dunno…It makes sense to me.

dfranke · February 12, 2025, 12:26am

Brent Spiner? Is that you?

andrewj · February 12, 2025, 11:18am

Looks like Vogon poetry to me. . . .

Lancelot · February 12, 2025, 3:44pm

Time to bring out the Vogon Poetry Appreciation Chairs! Strap 'em down!

dfranke · February 12, 2025, 7:05pm

What I’d say is it looks like it’s been the victim of Vogon poetry. Which is not that far off since really it’s been forced to listen to all of you for 700MB…

dfranke · February 13, 2025, 6:01pm

This second training run is looking much better! I still may have overdone it a bit, but maybe it’s fine. It’s close enough that I need to evaluate it more rigorously to be sure.