Story Development with Claude: A Methodology for Authored Interactive Fiction

When I worked on Textfyre, Michael Gentry and I constructed MS Word documents that contained all aspects of the story Jack Toresal and The Secret Letter. Jon Ingold used the same template for Shadow in the Cathedral. I am adapting that template to how I develop stories in Sharpee using Claude as a code generator. This document describes strict guardrails to prevent slipping into AI-generated prose, puzzle-mechanics, plot, theme, even grammar and punctuation.

The Problem

Large language models are trained to be helpful. When you ask one to implement a scene in an interactive fiction game, its instinct is to fill in every gap — invent a room description, draft NPC dialogue, design a puzzle, name the tavern. The result is technically functional and creatively hollow. The AI writes competent prose that sounds like no one in particular, designs puzzles with no authorial intent behind them, and builds a world that feels generated rather than authored.

Interactive fiction is an authored medium. The text is the game. Every room description sets tone. Every line of dialogue reveals character. Every puzzle encodes the author’s understanding of how the player thinks. These are not implementation details to be delegated — they are the work itself.

The solution is not to stop using AI for interactive fiction. It’s to draw a hard line between creative authority and technical implementation, and to enforce that line through process.

Creative Boundary Constraints

The methodology enforces a set of non-negotiable constraints on what the AI may produce. In practice, these are codified in a CLAUDE.md configuration file that the model reads at the start of every session.

  1. No generated prose. All player-facing text is the author’s responsibility.

  2. No narrative suggestions. Story direction, plot, and theme come from the author, not the model.

  3. No puzzle design. Puzzle mechanics are authorial craft, not implementation logic.

  4. No dialogue. Every character’s voice belongs to the author.

  5. No world-building. The fictional world is defined in specification documents, not improvised by the model.

  6. No unspecified implementation. If it is not in the spec, it is not built.

  7. Placeholder discipline. When player-facing text is needed but not yet written, the model inserts a TODO marker and moves on.

Why strict boundaries matter

There is a practical reason and an ethical one.

Practically, the most dangerous thing an AI can do to a creative project is produce work that is almost good enough. Near-adequate generated text is harder to replace than a TODO placeholder. It creates inertia — the project feels further along than it is, and the author begins editing AI prose instead of writing their own. Over time, the authorial voice drifts toward something generic and indistinct. A placeholder is honest. Generated prose obscures a gap behind competent but unauthored text.

Ethically, every commercial large language model was trained on copyrighted creative work — novels, short stories, screenplays, game scripts, and other narrative texts produced by working authors. When an LLM generates prose, dialogue, or world-building, it is drawing on statistical patterns learned from that intellectual property. Using AI-generated narrative in a published work means the creative substance of the project is derived, however indirectly, from the uncredited and uncompensated labor of other writers. Constraining the AI to technical implementation — code structure, state management, engine integration — avoids this problem entirely. The author’s words are the author’s own. The AI contributes engineering, not narrative.

The Development Cycle

The workflow follows a strict cycle. The author drives every creative decision. Claude handles the technical implementation.

1. Story

The author defines the story: world, characters, themes, tone, and arc. This is pure creative work. Claude is not involved.

2. Chapters / Scenes

The author breaks the story into implementable units — chapters, scenes, or locations. This decomposition is itself a creative decision (pacing, structure, what the player experiences and in what order). Claude is not involved.

3. Specification

This is where the method lives or dies. The author writes detailed specs for each unit. A scene spec should include everything Claude needs to implement it without inventing anything:

  • Room/location descriptions — The actual text the player will see.

  • Objects and scenery — What’s in the space, what can be examined, what can be taken or used.

  • NPCs and dialogue — Who’s present, what they say, how they respond to player actions.

  • Puzzle mechanics — Trigger conditions, solution steps, failure states, hints.

  • State tracking — What variables change, what flags get set, what consequences carry forward.

  • Exits and connections — Where the player can go from here.

The spec is the contract. If it’s not in the spec, Claude doesn’t implement it — it leaves a TODO. It may point out empty slots in the template, but it will not point out missing design elements.

Writing specs at this level of detail takes real effort. But here’s the thing: you’d have to make all these decisions anyway. The spec just forces you to make them before implementation rather than discovering mid-coding that you haven’t figured out what the blacksmith says when you show him the amulet.

4. Implementation Plan

Claude reads the specs and proposes an implementation plan:

  • What files need to be created or modified.

  • What engine features and data structures will be used.

  • How state tracking will work technically.

  • What order to build things in.

  • What’s fully specified vs. what has gaps.

The plan is presented to the author for review. Claude does not write code until the plan is approved.

5. Review

The author reviews the implementation plan:

  • Does the plan correctly interpret the spec?

  • Are the technical choices sound?

  • Are there gaps Claude identified that need spec work before proceeding?

  • Does the plan’s scope match what the author intended?

The author approves, revises, or sends Claude back to re-read the specs.

6. Execute

Claude implements the approved plan. During implementation:

  • Code follows the spec exactly. Where the spec provides text, that text is used verbatim.

  • Where text is referenced but not yet written, Claude uses 'TODO: author text' placeholders.

  • Where implementation reveals a spec gap (e.g., “what happens if the player tries to go north here?”), Claude flags it and keeps going.

  • Claude does not improvise. If something isn’t covered, it’s a placeholder, not an invention.

7. Repeat

Return to step 2 or 3 for the next unit. Each cycle produces implemented, tested code that faithfully represents the author’s specifications — with honest gaps where specs are still needed.

The Gap Conversation

The key to this workflow is what happens when Claude hits a gap. Instead of filling it in, Claude stops and says exactly what’s needed. Some real examples from my projects:

  • “There’s no scene spec for the throne room. I need: room description, exits, interactive objects, and NPC placement.”

  • “The puzzle mechanics for the finale aren’t specified. I need: trigger conditions, solution steps, and failure states.”

  • “The character’s dialogue for this encounter isn’t written. I need their lines before I can implement the conversation.”

This creates an iterative loop. You write, Claude implements, Claude tells you what’s missing, you fill it in, repeat. Nothing is generated. Everything is authored. The AI becomes a very fast, very literal collaborator that keeps asking you the right questions.

Spec Document Hierarchy

I’ve found it helps to organize specs at these levels, with higher levels informing lower ones:

Level Contents Example
World Setting, magic systems, physics, rules How the magic system works
Characters Personality, relationships, knowledge, voice Why the merchant distrusts strangers
Chapters/Scenes Descriptions, atmosphere, objects, NPCs, events The market square at dawn
Mechanics Puzzles, choices, branches, consequences, state Trust meter thresholds
Implementation Technical patterns, data structures, engine APIs Conversation tree format

A scene spec can reference character profiles and world rules rather than restating them. This keeps individual specs focused while maintaining consistency.

What This Gets You

This methodology works because it plays to the strengths of both author and AI:

The author provides creative vision, voice, world-building, puzzle design, and narrative craft — things that require artistic intent and cannot be meaningfully delegated to a model.

The AI provides technical implementation, code structure, engine integration, state management, and the ability to turn specifications into working software quickly — things that are genuinely well-suited to an AI coding assistant.

The boundary between these roles is the specification document. Everything above the spec is the author’s domain. Everything below it is the AI’s. The spec itself is written by the author and interpreted — never extended — by the AI.

The result is interactive fiction that is fully authored — every word, every puzzle, every story beat chosen by a human — but implemented at a pace that would be difficult to achieve solo.

Practical Notes

A few things I’ve learned along the way:

  • Session length matters. AI context windows are finite. For sessions where you’re implementing scenes with lots of author text boundaries, keep sessions shorter. For pure technical work (engine features, refactoring, fixing tests), you can let them run longer. The creative constraints are what degrade first.

  • Correct early and often. When Claude slips and invents something, call it out immediately. A quick reference to the violated constraint is enough. The correction reinforces the boundary for the rest of the session.

  • Write session summaries. At the end of each working session, have Claude document what was implemented, what gaps were found, and what’s next. This lets you start fresh sessions without losing progress.

  • The spec effort is real. This methodology front-loads the creative work. Writing detailed specs before implementation is harder than winging it. But the specs serve double duty as design documents, and you end up with a much more coherent game.

  • The AI will still try to help. Models are trained to be helpful. Even with explicit constraints, there is pressure to fill gaps. The boundary constraints and the process together create enough friction to keep things on track, but vigilance is part of the workflow.

Looking for Feedback

I’m curious what the community thinks about this approach:

  • Does the spec-driven cycle make sense for how you think about IF development?

  • Are there creative decisions I’m drawing on the wrong side of the line?

  • Has anyone else found workflows that preserve authorial voice while using AI tools?

  • What am I missing?

This is a living methodology — I’m still refining it with every project. Happy to answer questions about how it works in practice.

3 Likes

This sounds interesting, but I’m having a hard time visualizing what this sort of spec would look like. It sounds like (at least for me, someone who’s better at coding than writing) coming up with the spec would be the vast majority of the work, and the LLM wouldn’t really save me much time at all—for the implementation phase, babysitting an LLM (the “vigilance” you mentioned) sounds less fun to me than just writing the code. I doubt I’m the only person who has more fun writing code than reading it!

Could you share some example specs that have this necessary level of detail, and maybe ballpark estimate how much time/energy Claude saves you on implementing them?

4 Likes

I think most IF authors are going to fall into “code and write together” group. The way I’m suggesting is a different (but proven with Textfyre) way. There are other authors that wrote transcripts before coding their games. This is essentially the same thing, but more document-driven and less play-through driven.

That’s not really the part I expect anyone to cheer on. The part I want feedback on is the guardrails. Am I missing anything that might leave something in front of the player that is AI and not me (or another IF author).

And honestly managing Claude isn’t that hard and the warning about it’s helpfulness is just a requirement. With the CLAUDE.md I have setup, Claude will 99% of the time adhere to the separation of Author and Code Generator.

Lastly. I’m doing this so I can tells stories faster than I could with other platforms. If no one else ever uses Sharpee, I’m still satisfied with its creation and my intended use case.

David, thanks a lot for writing this up! An interesting project. As Daniel suggested, it would be great to see an example.

Since the idea is to save the implementation details from author, what are your experiences about getting back on track if the context gets polluted, e.g. because the AI simply didn’t obey the rules or by bogus instructions from the authors themselves? And in my own experience, it is a volatile step when you need to move to a new chat because the context size got to large.

This is going to be one of the things I can do and others may choose not to. I pay for the Claude Pro Max $230/month subscription. I get 200k context windows with Opus 4.6. This is absolutely a difference-maker when generating code. The acceleration of Sharpee completion was very clear from December to now. If you’re using a free or $20/month subscription, you’re going to get a smaller context window and session iteration will be smaller and more frequent.

This is actually why designing the story in a template is important. You can leverage each session effectively.

I’ll provide examples as I get further down the dog-fooding path. The templates are

appendix.md (212 Bytes)

chapter.md (1.5 KB)

puzzle.md (763 Bytes)

technical-notes.md (605 Bytes)

world-reference.md (3.0 KB)

attached.

This is disputed. Many would say that GenAI plagiarises code just as much as it does prose.

This is the part that puzzles me a bit about your intended use case. It sounds like writing the spec to a sufficient level of detail probably nearly as much effort as just writing the code, plus then you have the effort of guiding Claude in writing the code and then you have to go back and fill in any placeholder text you missed. I can see how this might appeal to someone who is more confident with writing a detailed specification than writing code (although they’d still need to be able to check Claude’s work), but that clearly isn’t you. Have you measured how long it takes you to implement a section of a game this way versus doing the same in an established authoring system without the AI assistance?

1 Like

The volume of public domain code (especially Typescript) gives me comfort on this point.

The resulting story from Sharpee will also be something I prefer and I think building story-specific clients will be vastly easier.

Like I’ve said in the past: Inform 6/7 and TADS never made me comfortable. I definitely was one of the people that got lost in the declarative nature of Inform 7. “Kinds” drove me nuts.

By designing by template and leveraging a GenAI-friendly toolset, I get to an end point much faster than I otherwise would. And the client (Tauri cross-platform installer) affords a very simply delivery mechanism.

I need to answer this another way:

There is always a “translation” from the author’s intent to code. I built Sharpee to simplify that translation. Instead of worrying about Inform 7 paradigms, I can focus on IF Storytelling paradigms.

I can write a set piece of rooms, puzzles, dialogue and I literally don’t care about the implementation in code. I will also implement the tests to make sure the implementation matches my storytelling intent.

I 100% understand that there are people who believe this translation from intent to code is just as important for the “art of IF” as the writing and the story, and I see that point well-enough. I’m just choosing to move forward without that distinction.

I’m not sure I exactly remember the details, but I believe Zarf made automation tools to build Hadean Lands. Is this not the same thing?

You’re responding to a point I didn’t make here. This is a separate debate, but my point was actually: to me it sounds like the end result of your efforts is that you can now do in X hours something that could already be done in X hours? It’s possible I’m wrong and your workflow is genuinely way faster, but for someone who is a competent programmer I don’t feel like the actual implementation is the main time sink in writing IF.

Of course, if you really dislike working in Inform or TADS or any of the other existing platforms and wanted to make your own anyway, you’re welcome to build whatever workflow you like. I remember you saying in the past that you hoped to write a game the size of Curses; do you think you’re likely to do so with your new system?

FWIW, in case anyone else wants to try it for less, GPT-{5.2,5.3}-Codex have 272k context windows and can be used on the $20/mo Copilot plan.

1 Like

If I can create IF stories without any concern for code, yes. I will add that something as big as Curses is no longer on my bucket list. My WIPs are short and emotion-driven. None of my current ideas are remotely treasure hunts.

Here’s what you can’t see (yet). I design a set piece. Maybe 5 rooms, all the text, all the puzzle logic. Claude will write that code in literally 30 seconds, maybe 60 and test it based on my exit criteria.

I think it will be a fun exercise to do a speed-if demonstration on Zoom. I’ll get the madlib suggestions, write the spec, generate the story, test, iterate for an hour. The resulting story will be magnitudes better implemented and likely containing all of the madlib items with multiple endings. Iterations inspired by each compile will only add to the story.

I’m waiting for OpenAI to go bankrupt and someone else buys them out. When Altman is gone, I’ll start looking at the GPT models again.

I think this approach can only be attractive to non-programmers. I’ve been a programmer all my adult life, and have learnt Dialog and Inform-6 in recent years, so the part Claude would do here (step 4) is the part an experienced programmer can already do (and would rather do) themselves.

The danger for non-programmers, or people for whom programming does not come naturally, will be an inability to validate the code created by the LLM, except by a crude “does it work?” testing method. But I guess that would be acceptable for many people.

1 Like

Claude can also write transcript tests and walkthrough transcript tests that validate the play-testing. this is a critical authoring step and it’s been thoroughly designed for a non-programmer.

See: sharpee/stories/dungeo/walkthroughs at main · ChicagoDave/sharpee · GitHub

There are already tools that can check a correct transcript exactly against the output produced by the game to make sure no library message formatting etc. has been accidentally messed up. Your transcript tests only check for the presence of certain keywords in the response to each command. Is this a limitation of using Claude or could it check the full text?

This sounds like a conversation about “Why did you bother doing any of this?”

I’ve answered that question. It’s mostly about just wanting to go through the process of building a parser-based platform from my own developer and architecture experiences being a consultant for most of my career.

We’ll see how it goes writing stories and then I’ll have more to say on its success from my own perspective.

And pointing out that “there are existing tools that do that” is not really relevant to Sharpee because Sharpee is new, in Typescript, and can do nearly everything differently and in many cases faster and better.

1 Like

This is super interesting. Thank you for sharing.

Building an existing diff tool into your workflow seems like it should be possible even when the platform is new. This was a question about the technology - is the diff being carried out by Claude directly, and if so is that why it doesn’t check the full text? (Presumably that’s more expensive? I don’t know enough of the details to know if that’s how this works.) Could it just call out to an external diff program instead and check the response?

You asked “what am I missing?” and I was curious whether you’d missed a simpler way to do something that you wanted to do, that’s all.