Speech I/O Interactive Fiction With an Eye Toward Augmented Reality

“Natural” IF Interfaces May Be Closer Than You Think

Speech is natural between people, it may now also be close to being natural with computers too. Creating a natural flow between the player(s) and the story reaches for cutting-edge embedded technology employed with computer design principles stretching back nearly 70 years.

All early mainframe systems employed a system of “dumb” terminals connected to a large central server who’s only task was to process requests from the users entered through the striped down interface of the connected screen and keyboard:

Computing power has ebbed and flowed from the mainframe to the “edge” clients over the years. Sometimes (as in the case of a ‘local’ terminal on a Linux PC) the two reside on the same machine. The screens today are larger and full color but for large-scale processing we still access mainframes from “dumbed-down” computers.

Applying the proven “dedicated” compute model to IF may be the best method for achieving interactions with IF that are nearly as natural as talking with a friend. It may be cheaper to build a server/terminal IF system besides.

IBM broke with the ‘traditional’ integrated PC designs of the day with the compute unit separate from the I/O, specifically the keyboard and mouse.

The IF design calls too for the separation of I/O from processing and I was delighted to find that the separation actually turns out to be cheaper to make!

The Dedicated Voice Keyboard

Consider this embedded platform device, the STM32 ‘HUZZAH32’ as the base voice I/O platform:

This little board clocks in at around $20.00 and comes complete with battery interface and charging circuit, capacitive touch, and plenty of I/O pins including I2s, wifi, etc. It is in short all you need to interface with a microphone array and stream the input to an (optionally) dedicated server or PC running an IF virtual server. This board features “I/O a-plenty” and is a great development platform for new ideas along the building path.

Meanwhile, new microphone technology is cheap, compact, and requires only simple circuitry. For for these “few bucks” they’re accurate enough to detect a gunshot and it’s direction from 100’s of meters away.

Imagine, then, this little device housed in a brushed metal case with a slim battery and a printed microphone array based on these I2S microphone breakout boards. The device streams the voice input to the central server. What you have, then, is a universal speech HID that fits nicely in the palm of your hand.

The server may also be embedded. Since the server acts as a dedicated voice processor device and IF interpreter server we don’t care about anything but a server’s core functionality. Since we’re not tangling with I/O on the server we can get away with a cheap Orange Pi Zero Plus 2 or similar costing for around $25.00.

We don’t need even for a dedicated embedded server; a virtual machine running on a PC will do nicely. You get the speech processing power of the PC interfaced with your portable “speech box.”

We tell the server about the mic’s array placement with the Pulseaudio module module-echo-cancel for excellent noise-cancelling audio processing. Quality voice input, I (re)discovered, is more important than the recognition software for good results.

The dedicated Speech/IF server runs only the following:

  1. Speech processing daemon (PocketShynx or Tensorflow)
  2. IF web interpreter
  3. proxy server

The proxy server serves an ‘open’ wifi network and directs the player’s browser directly to the IF web parser on connection. This is similar to the TOS page you first see when you connect to a Starbucks wifi making connecting to the game from your phone effortless.

What the architecture really affords is opening the experience to several people around the table. Each person has their phone stood in front of where they’re seated. They can pass the speech device among them at will. This would be especially true if the device also contained a camera whereby players could look at augmented reality images either in their book or on the map spread across the table.

I’m going to take off from here using my book-based IF concept.

Print and Play

So now everyone has their well typepset book with tabs (created in LaTeX and possibly ‘psnup’ for signatures), their phone, and a map in the middle of the table. Each player’s phone displays the parser view and their books are chock-full of all kinds of information (and parser output).

We bundle the PDF for the player to print ahead of the game, “cheap and simple.”

I’ll use ‘Game of Thrones’ for this example. Now, I’ve never seen GoT except for the clip where the little guy declares a dual with his accuser by “blood rights of his house” or whatever.

Okay, so here’s the map spread across the dining room table, you can use a marker to designate where you (think) you are:

The book contains an inset map outlining the interior of the court chamber. You witness now the proceeding and decide the accused’s guilt:

say, “down with the little guy!”
“Order! Order!” Exclaims the judge, “we shall have silence! One more utterance and I shall have you thrown in the dungeon!”

Now, maybe you pipe up again and get thrown in the cooler. Maybe the guard winks at you and the game takes on a whole another genre, if you’re into that sort of thing. Who knows?

Okay, back to the main map, your posse decides to turn North and you see this:

So you explore around (possibly with a camera module in the speech device and augmented reality enabled with Javascript on the server)–a transparent overlay tells you where you are followed by a typographic dagger indicating objects if some are to be found. Descriptions for all this is found in each player’s book. Examining objects, for example, may render a 3D model to turn around and look at to your heart’s content.

Many other opportunities, like a D&D like setting where the DM gets a supplemental printed manual may be had. The mix of player dynamics is wide and all this may be done through speech alone.

So there you go, what do you think?

By the way, special thanks to Hanon Ondricek for getting me thinking in this direction.


my 2c;

I think speechIO could be a thing insofar as TTS and STT (Speech to text). But it would still nevertheless be a human interface to an IF system.

Interestingly, this sort of thing could reboot interest in classic parser-if technology, except this time around using a speech interface rather than typing words.

However, the real problem lies behind the speech interface in the semantic realm.

For some time, I’ve been looking into the problem of knowledge modelling; the problem how how to represent knowledge on a computer in an abstract way.

At this point in time we have an AI movement which is making the claim that DNNs (Deep Neural Nets) are on the verge of representing information as well as recognising it.

Except I’ve not seen any evidence in DNNs for;

  • knowledge representation & memory
  • thinking & reasoning
  • scaling (ie learning on the back of other learning)

What they are good for is recognition:

  • handwriting
  • pictures of cats
  • spoken words into text

Those things are great, and definitely useful, but there will be no progress on AI until bits on the first list are tackled.

I personally believe there is something significant missing from the current generation of AI and DNNs. I’ve spoken to several companies who’ve made the claim that they’re doing the TTS and STT bit first, and the “brain between the ears” will (somehow) materialise later once DNNS are scaled up. etc. etc.

Usually, these companies are, in reality, planning to ride the AI hype and sell out before the over-promises are discovered. Some are even successful.

The same happened last time around in the 80’s. Apparently back then AI was going to revolutionise factory robots and take people’s jobs. Sound familiar?

Anyway, i actually do think the first list above is possible but its core is a symbolic problem not a statistical one. Albeit, stats being useful as well.

Back in the 60’s the physical symbol system hypothesis was proposed by Alan Newell and Herbert Simon, which states that:

A physical symbol system has the necessary and sufficient means for general intelligent action.

I think there’s some truth in this, although research in symbolic systems wound up too much like programming languages. Instead what’s needed is a much more organic direction.

I’ve been working on some ideas in this direction, and i think early experiments might be appropriate for a game system as a first step. Game systems are a restricted domain so it makes things a lot easier.

And even if ideas don’t work out, it might make for a fun game still.

I should start a new thread but I’m lazy. Note that I call only for a modernization of the interface though I find your response intriguing, not the least of which as I have experience with Neural Networks. I’ve never thought about probability vectors for speech recognition. This is a fascinating idea; vectors “cut through” 10’s of thousands of word possibilities by taking the speech input in context.

Because I didn’t read your idea carefully enough the following is a vectorized IF framework. I leave it here as I feel it’s good stuff. Please know that the method you cite for speech recognition is, once again, brilliant.

If we reflect that we must define a conceptual framework for the system may I humbly suggest the Umpire model in wargaming where the Umpire role is taken on by the computer. The ‘red team’ is you and the ‘adversary’ blue team is the environment:

My second tenate is that (at least until our model becomes sophisticated) is that our story’s foundation is open-ended. There are no hard-coded storylines. When the player “wakes up in the morning” in this environment neither he nor the author has any idea how the story goes. We say, “here’s a sandbox, go play.”

To further use the wargame analogy (I hate war, btw.) we create a textual environment that is visualized thus (I use physical space here but may be applied to 'conversation landscapes, etc.):


Super basic.

So now we divide the landscape into squares. We then load up a corpus and teach the parser all about sand, hills, creeks, etc. in general. The overall area is characterized by a large corpus of area geology, chemistry, physics, and human descriptions inherently specific to that area.

If we’ve done our modelling correctly the computer now roughly:

  • knows the overall geology of the sand including “how it got there”
  • Knows the typography of the area
  • Understands various characteristics like moisture density, color, texture, etc.

I visualize IF authors spending their initial efforts not manually writing descriptions but rather scouring for source to build a large corpa.

Once compiled we perform word2vec GloVe global vectorization analysis (github) to build relationships in the corpa domain space.

The landscape is now a word distribution probability vector mesh which may now be roughly visualized like this:

From here we either a) manually list objects for each square or b) have the system generate the squares’ objects for us. The latter is similar to computer generated trees and ground cover in a flight simulator.

If we populate the physical space matrix manually we could build a corpus that consists of a code for each square and all the descriptive text within that square. The global vectorization will associate the facets of each location through an analysis of the text. Consider this screen capture depicting relationships GloVe found between cities and zip codes by parsing a large body of raw text:


With more magic we ought to be able to react reasonably intelligently when the player interacts with a tree:

x tree
[tree --> fir tree --> bark --> branches --> fir needles…all the mechanics of a fir tree]

The fir tree stands 30 meters tall with brown, wrinkled bark…

cut tree
[cut --> slice --> gash --> ooze…[all mechanics of cut fir tree]

The fir tree now has a gash exposing the whitish inner wood and is oozing a sticky, amber pitch.

And so on.

This is roughly as far as my thinking takes me so far. At the outer reaches of my insight I see rooms not so much as descriptions but as a large body of text that describes all facets of that location in bulk pulled from various sources.

This way the player’s input is parsed by parts of speech and run through the vector to see what they’re talking about. From here a supervised ANN and sentence constructor issues a response. When the player types:

I don’t know what to do

The vector will most accurately localize ‘help’ and issue the appropriate response.

Notice we are vectoring the player’s input in real-time against the Neural vector model.

Output may consist of the most closely matched excerpt of text either from the original (curated) environment corpus (with consideration to scope) or a separate “response” corpus.
I don’t yet know how we define ‘goals’ for the game but an idea might be to train the supervised ANN to experience ‘pleasure’ when it outputs prose that helps the player meet the goal.

So perhaps this is a start. If you think I’ve gone astray with this rough basis I’d love to hear your thoughts.

Addendum: should this idea of laying down an open-ended base framework (like wargames do) take hold we shall call them, “Exploregames.” Yeah, you like that? Pretty clever, aye?

What we’re talking now is an interactive simulation as opposed to pure interactive fiction.

A statistical model would work well to populate a simulation. Your “Umpire” is the simulator.

I’d like to add two ideas:

  1. Upper Ontology
  2. Story

Upper Ontology

This is the backbone of your information model, whether it’s statistical or symbolic. You need a fundamental set of semantic references on which to build and to exchange information.

Otherwise you can never represent anything because you haven’t defined what you’re talking about/.

I’ve always thought the very base ontology has to be created manually. After that it can be added to by automation, learning and pattern processing. In this respect the system is not entirely the product of corpus digests, but also a “kick-started” manual core.

When you ask “Animal, vegetable or mineral?” You’re citing a base ontology, albeit a very small one.

The question is, what’s the minimum size needed to bootstrap.

Story & Purpose

This is more of an oblique point and not directly relevant to the functioning of a simulation, but;

Without a story, there is no entertainment, and without entertainment, the system has no purpose.

Let’s say you’ve built an IF system with a truly awesome world model. So rich that the system “knows” exactly what you could do, and how to model the outcome. How every object in the world could be used and manipulated, even to the extend you could combine objects and they would work together without special programming or additional coding.

It would be great, except for one problem:

There would be no gameplay, or at best a simulation real-life boring drudgery that would be pointless to play.

I think it’s always a good test to ask, where does the game come from?" in an IF system.

It would be nice if the system could somehow generate stories, but to me that sounds too much like thinking and if it could do that then we’re better off making killer robots!


Your observations warrant a new thread–I started with a proposed method for importing an Upper Ontology into an IF engine.