Does anyone use AI game masters for solo storytelling ? Curious about long-term coherence

I’m testing a few AI-based narrative engines for solo RPG experiences and I’m wondering:

How do you handle world-building coherence over time when the AI generates large parts of the plot on the fly ?
Have any of you tested systems that are actually able to remember past events or characters effectively ?

1 Like

It has been multiple years, but there was a time when I used to enjoy playing around in this website called “AI Dungeon”. There never really seemed to be a way of getting the AI to remember events beyond the very recent, so you could—either—go along with the nonsense to have a very funny adventure or diligently edit and redo the AI’s prose to get something coherent—which at a certain point just became writing with extra, unnecessary steps. I have no idea how recent models fare, but that is my experience.

1 Like

I think the knowledge window of AI is going to be a persistent and largely unsolvable problem. If you bake out stuff that the AI makes up, you either lose the flexibility of AI at that point, or try and get the AI to use the baked stuff, but then you get back to the original problem. It would take a new design of AI - which I think is entirely possible, but not likely to happen to serve this niche activity. Perhaps smaller scale custom AI designed for this specific task is the way forward, but that is almost certainly out of the scope of what can be done without megabudgets.

1 Like

I have a lot of contrarian opinions about this, but the main one is this:

LLMs are very, very good at 1) understanding abstract commands, 2) reasoning at near-human levels about how they might affect the game world and plot, and 3) writing dynamic description that reflects those effects. These are precisely all the things that traditional parser fiction is bad at, so there must be a way to harness that power to deliver a good IF experience. But the answer is neither slop (letting AI write all your text), tweaks (slapping an LLM on top of existing parser systems), nor hallucination (letting the AI invent the whole story). Rather, the fundamental breakthrough will come when we figure out how to harness LLMs to run just the parts of IF that LLMs are good at while keeping human authors firmly in control of the parts of IF where we want human artistry.

I have done enough experimentation with this to be convinced that this is completely possible with existing LLM technology (and will only get more possible in the future). It’s just a matter of figuring out the right parts of your IF system to hand over to an LLM, the right ways to query it, and the right ways to filter the output. I imagine a myriad potential hybrid systems: for example, a conventional parser controls the world model while an LLM runs an overlying social simulation, with opinionated NPCs the player can talk to in full English sentences.

This approach will unlock fundamentally new types of interactive experiences, much as Twine is suited to fundamentally different kinds of interactive experiences compared to Inform. Conversational social dramas may be one such “killer app.”

My other main contrarian opinion is that pretty much nobody will believe this is possible until someone does it, at which point it will seem obvious and inevitable.

LLMs are very, very good at 1) understanding abstract commands, 2) reasoning at near-human levels about how they might affect the game world and plot,

If I may add my own contrarian views, albeit grounded in technical fact (source: I have written patent applications on AI-related inventions, amongst other things), LLMs do not do either of these things. An LLM is simply a pattern matching device, finding the most plausible output for a given input. It sure looks ike clever reasoning sometimes - you can (to a degree) thank the cleverly reasoned data it was trained on. Beyond that, there is the hard technical limitation I mentioned of the context buffer always being finite. This is why the circle will never be squared to give a 100% reliable result for reasoning tasks.

Linear improvements in performance require exponential increases in complexity and resources. We are nearing the limits of that. Furthermore, better models need more data to train on. That is also a finite resource, not least because it gets harder to find pristine training data that does not inadvertently include ‘poisoned’ data which was output by AI instead of humans.

None of which really matters, but this is why I think AI generated content for adventure games is always going to require compromise and limitations of some sort.

10 Likes

AI slop has essentially “poisoned the well” by filling the internet with garbage. I mean AI garbage as opposed to actual human garbage, which was already plentiful. AI can produce garbage far faster than we humans can.

4 Likes

Please, please, I beg of everyone in advance, do not turn this into another thread hundreds of posts long arguing about AI.

4 Likes

Long-term coherence is also required by people who write longer stories or ongoing series, and while there are many different techniques a human can use to track what came before so it can be used to help with the level of continuity of what comes after, many of those techniques rely on a consistent concise summary of some sort that includes just enough details about the “important” things to keep that level high.

eg. some Authors keep detailed notes about characters, the environment those characters exist in, and what has happen to those character.

An LLM is similar in this manor, it uses the (ever increasing?) content supplied via the “prompt” to determine what comes next. And general the larger that supplied content is the more likely the LLM will hallucinate, assuming that the bias of the LLM introduced via its training data remains consistent over the time of that LLM’s usage.

A similar issue exists in the Software Development usage of AI to generate code for the most popular programming languages, in that the larger the code base supplied to the LLM, the more likely the output will contain inconsistencies. One technique being trailed/used with some success is to supply concise specifications of what functionality the code base requires, so the LLM can use that information when “making things up”.

So if a well structured concise summary of what the LLM has outputted can be generate on an ongoing basis, then it can supplied to the LLM as the equivalent of a software development specification, which may help the coherence of the output to remain higher for longer.

1 Like

(Yay! Another technical contrarian! We have way too many threads about AI slop and not enough debating what else it might be capable of.)

Whether LLMs are really “reasoning” is largely a theory-of-mind question, but that doesn’t mean their output can’t still be useful. Consider the following (schematic) exchange:

PLAYER: Tell Isabella “Wow, I wish I had the confidence to go out dressed like that.”

GAME ENGINE TO LLM: Based on the provided context and state information, assess how this action affects the mood of any characters present and respond with a summary justification and database update in the specified JSON format.

LLM: OK, we know that Isabella tends to be insecure, has been at odds with the player before, and is dressed casually, so in context she would probably interpret this as a subtle insult about her clothes rather than a compliment about her confidence. {“character”:“Isabella”, “mood”:“insulted”}

Does it matter if/how the LLM is actually “reasoning” so long as it still provides useful output? This kind of free-input sentiment analysis is impossible in Inform, but trivial for an LLM.

This is not what the data show. In the early days, the prevailing orthodoxy was that smarter models would always require more parameters and more data, but this has not actually turned out to be true. Benchmark performance and context length for models within the same weight class (e.g., 20B parameters) continues to improve, and there’s a strong industry pressure towards improving small models because they’re cheaper to run. You can now run LLMs locally on your smartphone (Qwen3 4B, Gemma 3n, etc.) that beat the pants off of the original ChatGPT in logic ability (although they usually lack niche subject knowledge).

We may be reaching the limit, but I also stand by my claim that current LLMs are already capable of good IF experiences. It’s an architecture problem rather than an intelligence problem.

Agreed; but the same is true of traditional systems. We need to be asking what new experiences AI can enable rather than dismissing it because it’s not a panacea.

1 Like

To be clear, I am not having an argument about AI but trying to focus on the pragmatic issues with making adventure games with AI. Pattern matching involves no logic and follows no rules. Call that reasoning if you want, but the ways in which it is not are germane to the challenges of producing consistent and repeatable output based on a set of starting conditions.

1 Like

Prompting a chatbot to do sentiment analysis seems like a roundabout way to do it. Why wouldn’t you use an algorithm designed for that task, possibly implemented in Inform?

But even if you did do this, there wouldn’t be much point in accepting freeform input and analyzing it only for sentiment. You might as well just implement an INSULT OBJECT command.

The real breakthrough will have to be an advance in world modeling, or NPC modeling. Nothing was stopping anybody from parsing freeform text or analyzing it for sentiment. That’s not the hard part. The problem was, as it remains, what to do with your analysis. That is, what kind of world model do you need, in order to work with richer information than medium-sized dry goods entail? If we’re giving up on using the LLM’s context window itself as the world model, then talking about LLM applications at this stage seems like a hammer looking for a nail.

2 Likes

This comment is so close to getting what I’m talking about, yet so far away.

Sure, you can implement INSULT ISABELLA, or let the player choose from a list of dialogue options with predetermined outcomes, or do procedural sentiment-analysis to see if their input is (for example) more positive or negative. Those are all valid approaches that don’t require an LLM. But they’re a very different experience than what I’m envisioning:

Lady Agatha sits across from you, sipping from her glass of sherry without a hint of pleasure. You can tell she is none too pleased by your plan to marry Lawrence (a commoner? bah!), but if the marriage is to happen, you must have her blessing—or at least, her silence.

> AGATHA, WHAT DO YOU THINK IS IMPORTANT IN A MARRIAGE?

She raises her eyebrows, clearly annoyed at where this is going. “It is important that each party bring something to the table,” she says firmly, gripping her sherry-glass tightly. “An unequal marriage only leads to ruin and disgrace. Like your uncle Peter,” she adds.

> PETER SEEMS TO BE DOING JUST GREAT.

Lady Agatha guffaws. “‘Just great?’ I see Lawrence’s uncouth grammar is rubbing off on you. In any case, if you call Peter living practically penniless in Wendover ‘great,’ I suppose I can’t help you. I heard he sells insurance. Insurance!”

> I MEAN THAT HE SEEMS HAPPY.

Lady Agatha looks down into her glass and doesn’t respond for a long moment. She fidgets with her lace napkin, the heavy rings on her fingers clinking faintly in the silence. “Well…” she says at last, but does not continue.

> AGATHA, ARE YOU HAPPY?

Really think about what it would be like to play a game like this—navigating complex social situations in a battle of wits against autonomous NPCs who react to even the way you phrase your statements. It’s a totally different kind of interactive experience than what can be achieved with conventional parser IF.

Where I agree 100% with Paul is that the central challenge is not getting LLMs to analyze commands (that’s a simple API call), but what to do with that analysis. For the above example, the LLM must obviously handle a fair amount of the world-state and writing beyond simple sentiment analysis—yet without descending into directionless hallucination. Getting the balance right is difficult, but, I think, far from impossible.

I think you’ve been explaining your concept clearly. My point is that if the finer points of your input don’t “stick” in the world model, nothing has been gained over an INSULT command. On the other hand, a game that that could thoroughly model the results of “INSULT ISABELLA’S CLOTHES” would itself be a big advance. Maybe generating a summary to include in the prompt, as mentioned, can serve as something like a world model, but I think the world model isn’t the place to cut corners.

4 Likes

Yeah, that’s the issue I’m not sure can be surmounted.

LLMs, by design, hallucinate. It’s an unavoidable consequence of the architecture. This makes them very bad at maintaining a world model.

This suggests they’d be better at front-end tasks. But the reason parser systems don’t currently support INSULT ISABELLA’S CLOTHES isn’t the parser, it’s the world model required for such a thing. A front-end that can handle AGATHA, WHAT DO YOU THINK IS IMPORTANT IN A MARRIAGE? isn’t helpful without a backend that can meaningfully represent that in the world model.

I’m happy to be proven wrong on this, but fundamentally something needs to act as a world model on the backend (otherwise you just have AI Dungeon and that’s already been done), and I’ve never seen compelling evidence that an LLM can do that.

7 Likes

It seems obvious to me that some kind of LLM will play a role in any kind of advanced language processing going forward. The ability to, through sheer free-association from the context, generate appropriate output, is far beyond what previous technology was able to do.

This is an advance similar to AlphaGo’s advance in board-game playing, taking Go engines from wimpy to superhuman in just 2 years (counting from the start of the project to its victory over Lee Sedol). The difference is that AlphaGo’s activity is intimately tied to the world model, by virtue of the fact that it’s playing a game with a defined state and rules.

I suspect that something of this kind will be necessary to apply LLMs effectively to IF. That is, the world model will need to be part of the LLM’s input, and, importantly, its training. Trying to hack that information into prosumer chatbots via textual prompts, I’m skeptical of.

1 Like

Not sure if this is stating the obvious, but if you really want a “full” social simulation - one that simulates human characters with the same degree of interiority as a human - then what you’re asking for is an AI that can fully simulate a human’s interiority, i.e. an actual human-level intelligence. I feel like it’s easy to lose sight of that when handwaving about an LLM hypothetically being able to develop a sufficiently complex world model or whatever.

1 Like