arXiv: Can Language Models Serve as Text-Based World Simulators?

Figured, this might be interesting to people who look into (or consider doing so) utilizing LLMs for IF purposes.

Tl;dr from the Abstract:

We test GPT-4 on this dataset and find that, despite its impressive performance, it is still an unreliable world simulator without further innovations.

Link to paper.

7 Likes

Nice idea, but the results are hardly surprising. LLMs suffer badly from loss of context and that’s precisely what a world model requires - accurate state.

LLMs (and others) are pre-trained on a giant pile of static information. If you then ask them to model dynamic information, you’re using the wrong tool for the job.

6 Likes

I’ll kept quiet about LLM and AI, lest someone homonym with an Admiral get strange ideas, leading straight into another long round of identifying and dealing with incompatibilities, but suffice to say that if we abstract the “giant pile of static information” as “world model”, perhaps (VERY perhaps) a real IF NL can even became feasible.

Best regards from Italy,
dott. Piergiorgio.

1 Like

There have been some interesting experiments in the “AI writing tools” world with summarization loops, where the LLM is asked (behind the scenes) to generate summaries of its own output and keep certain state-tracking document up to date. These are then appended to subsequent queries to improve self-consistency. This is still not a perfect solution (and I think current LLM development is more focused on the “ideal” approach of improving the LLM’s comprehension of a single unsummarized input stream), but the continued improvement in context-consistency in, e.g., GPT-4o suggest that this is an area in which LLMs can continue to advance.

For decades (!) there has been a pattern where newbies will pop up saying, “AI will solve all our parsing problems!”, and then slowly realize why AI does not, in fact, solve all our parsing problems. I still don’t think LLMs have much to offer traditional parser IF, but I’m starting believe in the feasibility of a wholly different sort of parser IF where the author defines the geography, story, puzzles, characters, etc. in fairly granular detail, but the actual physical interaction and mechanical prose is simulated (hallucinated?) by an LLM.

A few tools (like AI Dungeon) have started to go down this path by maintaining basic human-generated worldbuilding reference books, but without a really rigorous system of state-tracking this still rapidly descends into stream-of-consciousness weirdness. But what if, rather than just trying to provide loose guiderails for creative hallucination, there was a whole system of prompts, state-tracking feedback loops, handwritten prose, and even bespoke summarization and revision bots all laser-focused on keeping the world and story in lockstep with the author’s original intentions? In other words, what if instead of improving LLM’s creativity, we focused on limiting that creativity to act as directly as possible as a simple linguistic bridge between the author’s intentions and the player’s experience?

(For example, one part of this system might be a secondary revision loop that takes the LLM’s initial draft output and screens it for consistency with the state model, aggressively revising or excising any references to anything - objects, events, whatever - that aren’t explicitly mentioned in the world model and plot databases.)

The types of IF experiences well-suited to this sort of system would naturally be different than those well-suited to traditional parser IF, but I think there are some. For example, LLMs are particularly well-suited to “type anything”-style conversation with NPCs, if given enough guardrails for consistency.

1 Like

Thank you for your insights. It is my understanding that LLMs have a “context” which consists of your original prompt plus whatever it has emitted so far. This context has a size limit and, after bits have “scrolled off”, they are no longer part of the working context. So world facts “drop out”.

For example, i tried making plots with AI, but the names and details of characters in the story get “forgotten” and this quite a problem because the context is quite small. I understand some of the newer LLMs have bigger contexts. Let’s hope so. But even then, I can’t see it capable of holding even a short story.

This is correct, except that for a raw API call, the context can technically be whatever you like. The idea of the summarization technique is to use a separate AI session to identify and condense the most important facts from the output down into summary blocks that can fit within the context window. Then, rather than giving the LLM the entire session as context, it pulls the most relevant summary blocks (by some algorithm) and gives those as context.

This is a bit of a hack, and less necessary when using high-end models like gpt-4o or Claude 3 that have context windows the length of a Salinger novel. However, summarization and revision loops can still be helpful in directing the LLM’s attention, having it automatically identify and correct its own inconsistencies, etc. There are a lot of interesting pre- and post-processing techniques that have yet to be employed in IF.

With respect to LLMs interoperating with simulated storyworld environments, I’d like to share a hyperlink to an unfolding discussion: https://github.com/WICG/proposals/issues/168.

There, we’re talking about how to bridge AI assistants, e.g., LLMs, with visuospatial content such as: (1) charts, diagrams, and 3D illustrations, (2) CAD/CAE applications, (3) computer simulations, and (4) interactive fiction and videogames. The approaches being considered involve knowledge graphs.

I am hoping that solutions for enabling accessibility, Q&A, and dialogue about charts, diagrams, and 3D illustrations will generalize to those other interesting domains.

So, when a player in a room puts an item onto a table there, should they later return to that room and ask the interoperating LLM about where that item was, the LLM would be able to respond that it was where they put it, on that table (unless some other agent moved it).