Audio IF

Hey All–
I’ve been polling everybody I know and everyone I meet (mercy, I’m annoying) about why they don’t play IF, or if they do, why they don’t like parser IF, and what I hear time and time again is these two things (eliminating the people who just think games and fiction are a waste of time):
1.) I don’t want to type.
2.) I don’t want to read a game, because I have trouble reading (dyslexia, eyesight issues, lack of time, not a good reader, etc). There are A LOT of people who love fiction and games, who should definitely be our people, who listen to books all the time, but who simply don’t find reading enjoyable for whatever reason.

Which all got me thinking and asking people: what if you could play IF the way you listen to an audiobook, vocally controlling user input? And there was a very positive response to this, with people saying that they could play on their commutes this way, or during any of the times that they listen to podcasts or audiobooks. But they all said that very good voice acting was an important component of their willingness to try such a thing. And that sort of eliminates robotic Siri/Alexa-type screen reader programs, which I think are not thrilling narrators. I did find this topic discussing something kind of like this, but don’t know if that went anywhere. There’s a program called Perplexity that’s been used to write some games, and its mission is to understand user language so as to take typing out of the parser experience. As I recall, the games were hard to play and the program needed some work (or maybe it was just the games and the program is fine). And I found this topic about an audio interpreter, but it didn’t seem to gain any traction here.

Has anyone tried making IF with real voice acting, combined with the ability to vocally direct the user’s commands? I wonder if such an adaptation would bring more people into the fold and perhaps get paying customers. Thoughts?


This appears to be out of date, but it leads to a couple Git repositories.

This would be a lot easier to do in a choice game. Many Twine flavors allow per-passage sound, you could record the text and number the choices if you could fudge it so keypresses or saying a number activated each choice that was narrated.


I briefly explored the subject, first in the context of young kids who just don’t know how to read, second on commute.

Some takes :

  • good actors are expensive
  • having multiple actors is very expensive
  • speech-to-text is pretty good for classic verb+noun, the problem is when you need to correct what you entered. And if it didn’t understand you the first time, you’re in for a ride.
  • people don’t actually like speaking to the machine for a long period of time. Because the more you speak, the more you risk hitting the problem just above, causing stress.
  • people don’t like talking to the machine if they are not alone (especially if you’re playing a game, you tend to say very akward things !)

I gotta say, your idea is amazing! There is a quite good example of a story being told only with narration in a game called Killer Frequency. It’s a “slasher” 80’s style story where you work as a radio station host and you have to handle 911 as the killer is on the loose. You don’t even see the killer (or the scenarios in which the chasing sequences are happening), and the voices carry the story in such a way that creates tension and suspense with only voice acting.

You can give it a try by “parsing” the answers with simpler commands and yes/no answers to start. I think using an “assistant” to say things like “Maybe we should do x or y” for the player to pick and have some agency without having to say extremely complex things for the game to understand.



Can confirm how expensive it is. This is also something that novice devs often overlook, because Uncle Bob would only pay $50 for narrating an entire book, so why would they pay more (this general sentiment is extremely common).

Was doing a project where “only” around 300 words were needed, for 2 actors, which sort of had a budget about the same size (roughly the minimum for union prices, for per word!). Have spent 4 weeks of just to find people to agree to my “indie” budget (that was all one could master, but understood their decisions), and more trying to convince people to audition. 90 something percent never replied, some even sent me on my merry way in a not so friendly email (they should’ve looked at Fiverr for inspiration).

This mattered most, spending all the production budget on only 2 voice actors, well… because VO work requires more than just reading some lines. Acting is where it’s at. And note that these actors were at the bottom of the chain (semi-professionals). The biggest issue was finding the right voices that matched the right time need for their roles (you simply can’t expect certain people, say with high pitched voice, to be able to pull of a deep and rich one).

The ones that are doing this for a living, start around $1-2K minimum, and do takes by the hour and paid similarly by the hour. No revisions, nothing, because that time isn’t paid. It can become astronomical real fast.

Worth also noting that the absence of good voice acting sticks out like a sore thumb, and can turn players off pretty quickly. Which is why so much effort was put into finding the right talent (within the restrictions), because the VO was supposed to sell the product (it was that important, and that was just a tactical game, not even IF).

Also, forgot to mention that often the actual thing with recording the lines is missed: Do you have a equipment to record, and or does your actor have a studio? Which all effect the final quality of their voice. Which also needs to be processed, preferably by an audio engineer.


Yes. But for a test case using an existing game, surely there’s SOMEONE here who could do a pretty good job and who might donate their time. I mean, there can’t be all these IF nerds without at least a few of them also being theater nerds.

Well, what about a command prompting the program to enter something. Like, you might be muttering and thinking aloud and then you’d say COMMAND; GO NORTH. That way it would only enter what you directed it to. Not sure how to deal with correcting bad commands.

Of course, I have absolutely no idea how one would go about this. But I wonder if we picked a well-known and good game, and all threw our weight behind it and made such a thing, if it might not get some attention. Or, of course, it might not move the needle for IF at all and it’s all just a big waste of time. But it just drives me nuts that I think IF could go more places if we met the needs of more users, so that’s why I’ve been harassing everyone about what could be done.


I think it would require a major redesign from the traditional parser paradigm. Most parser transcripts are filled with ‘You can’t go that way’ and ‘I can’t see any such thing’. Hearing that for the 15th time while working on a nasty puzzle would be really off-putting.

So I wonder if a more story-focused game with less hard edges would work better. Mirror and Queen, My Angel, and Laid Off from the Synesthesia factory all do that. All three abandon traditional error responses and just progress the game if the character messes up.

Another option could be drastically increasing the output-to-input ratio, so there is a lot of text for each command. That leads more naturally towards a choice-based game, though.

I do agree with others that I’m loathe to use vocal commands in public or around others (because I don’t want to annoy people).


If nothing else, you could always hit up sites such as Casting Call Club. Some would literally kill for an opportunity like this. Just don’t expect much.

  1. People don’t want to read
  2. People don’t want to type

Actually they do! This is a myth. And especially on commutes. All of 'em texting away like crazy.

So what am i saying…

My theory is that it’s to do with the ratio of interaction to response. Short turn around, like messaging gets people interested. Long walls of text don’t.

And pictures. And sounds. But that’s another direction.

To be specific, i don’t think you need voice input. choice + click + minimum parser is the answer to input (as covered in Mike Russo’s article).

For output, voices would indeed be nice. For some time now, I’ve been keeping an eye on the evolution of TTS. To get anything at all hopeful, you need some kind of speech markup (eg SSML). But even then, current offerings are poor.

But now we have AI TTS. Been looking into this as something perhaps good enough to use. Right now, I’m seeing voices that sound good, but without enough control over emphasis, emotion or intonation. Hopefully, that sort of thing can be added to the AI. One day.

One day.


This is what people are telling me-- that they don’t want to do this for fun. And they are telling me me this over and over and over, in all age groups. Obviously your sample set may vary.


Have you thought of adapting something like or as an experiment (with author permission)? Or writing and implementing a similar game yourself. Each one has just a single command, and is small enough that the voice acting would be something like 15 minutes or less of work. The limited verb set should make it easier for voice commands to be recognized, too.


Yes, this would be the way to go for a trial. But, er, I’m not thinking of doing it because I don’t know anything about how to do it. I was kind of hoping someone smart would want to try it. I’d certainly be willing to do the type of grunt work that I’m capable of doing, or of supporting it financially, etc. I think it’s an idea worth batting around some more, for sure. I just can’t accept that all these readers and gamers, people who should want to play IF, are not playing it, and if it’s because we’re not meeting them where they are, that’s a pity.


Admittedly I have considered doing this, but what I had in mind was an audio-only, realtime shoot-and-loot game, similar to the Toby Accessibility Mod for Doom 2. This probably reveals a lot of my gamedev biases; I hadn’t even considered voice acting quality being an obstacle…


I had a thought about this the other day. I was thinking about taking a long running television show with a fixed cast, like Law & Order or Grey’s Anatomy, and using the literal hundreds of hours of audio dialogue by the same actors in many varying situations and rearranging it to form a completely different narrative. Like using voiced lines from the Simpsons to tell a hard hitting horror story for example. You could then use this voiced narrative you arranged for your game.

It’s definitely a transformative work, making it fall squarely under fair use. I don’t think you could go commercial, obviously, but that isn’t the intent much of the time.

I have more thoughts on this, but I’ll have to circle back around.


I stumbled across an authoring system called StoryHarp on IFWiki. There is or was a version that supported voice interaction, and apparently a game was written with it. Just mentioning it in case it might be relevant/useful to see what else has been attempted before.


This is an interesting subject for me, because while I am a typer and a traditional computer user, I also take pleasure in spoken language and in the act of expressing meaningful information using correct language. Honestly, I occasionally take enough pleasure in the narrative continuity of the dialog conceit of the parser to actually type out longer form commands instead of abbreviations, although most of the time the effort to do so outweighs the pleasure in the narrative expression of my idea. I think speaking naturally and being easily understood would be close to an ideal, but maybe that’s so personal as to not be applicable for many other typical parser IF players.

Yet, as @AmandaB indicates in the OP, outside of the typical IF playerbase, people often at least say they might enjoy input by speech in a narrative game. As others have said here and elsewhere, being recognized by the speech input system can be difficult. I learned how to speak in a specific way to be recognized by a specific phone system, and it was annoying because I feel like I have to adopt the persona of an American military deskman in order to get anywhere. But that old telephone-based system is probably inferior to the consumer audio gadgets and run-of-the-mill speech software and audio assistants that consumers have these days.

  • good actors are expensive
  • having multiple actors is very expensive

But synthesized voices have come a very long way. Microsoft Word has several very nice English computer voices that can read your documents; you can turn texts into workable audiobooks for free by copying them into Word. I’d certainly be willing to try playing an IF game that way.

What somebody should do, for an experiment, is rig up a screen reader/speech-to-text solution for both input and output, and then record all the input and output audio as some kind of podcast IF “let’s play.” So, it’s Microsoft Bob reading the text of Uncle Zebulon’s Will or whatever, and then it’s you or me IFer saying the commands, for an entire play-through. That might be fun and illuminating.


I think the answer to “I don’t want to type or read” already happened and it was called graphic, point-and-click adventures. Would an audio-only game be more enticing to really anybody than a slightly verbose graphical adventure or RPG? (Assuming for the sake of argument that the hypothetical game is playable non-visually.)


What I’m hearing is that some people want written stories, and they want to interact with them. They just prefer having them read aloud. I know quite a few people who are huge consumers of audiobooks, because they can cook dinner or drive or make pottery while they get their story, or because their dyslexia makes reading a chore. Is this enough people to make a difference? Maybe.


So, I would argue that the key point to this is being hands-free. Folks want to be able to drive, cook, clean, work, whatever, without pulling out their phone to tap an option or, worse, text.

It doesn’t need to be a faithful recreation of a text parser and, quite frankly, the complexity of that sort of object-driven world model would quickly run into problems with incorrect voice recognition, as others have pointed out.

So, don’t do that. Meaning, narrow down your verb set to directions (NSEW+UD), look, and, when options become available either 1, 2, 3, or A, B, C, whichever is less likely to be confused by voice recognition. Take the world model from parsers and the streamlined choices of Twine and mash them up. Drop the inventory and objects altogether. Being presented with various options will be the most reminiscent of CYOA which is the most IF thing many of this audience is familiar with, while adding the exploratory layer of walking around and seeing the world around them.

Peeps want to have an interactive experience during their commute, maybe trying to figure out why they, as a member of a Mars first base, woke up alone, and must discover what happened. Or perhaps they want to simply romance their High School Volleyball Coach. I suspect most don’t want to tackle Hadean Lands in their head during their commute.

The game should be endeavoring to be touch free. If the airlock is west, if you say “west” it should have you suit up, cycle the lock, and step out into the Marsian plain all in one cinematic description. Someone merging into rush hour traffic doesn’t want to figure out all of that in incremental steps.

This shouldn’t take more focus than listening to a podcast.

Let them drive.