Microsoft Bot Framework and Dialogue-based Interactive Fiction

AdamSobieski · June 13, 2022, 6:22pm

Hello. I would like to share an idea that the Microsoft Bot Framework could be useful for building interactive fiction experiences.

The Bot Framework provides developers with a means of creating dialog systems across channels and devices. Supported channels include, but are not limited to: Microsoft Teams, Direct Line, Web Chat, Skype, Email, Facebook, Slack, Kik, Telegram, Line, GroupMe, Twilio, Alexa Skills, Google Actions, Google Hangouts, WebEx, WhatsApp, Zoom, RingCentral, and Cortana.

Players could play using PC’s, mobile devices, and smart speaker devices, participating in text-based or spoken-language-based dialog.

We can envision audio-based user experiences with ambient music, sound effects, and conversational user-interfaces involving interactions with both narrators and non-player characters. We can also envision user experiences where players engage in natural-language dialogs with player-characters who can report observations and happenings.

To achieve these things, the interactive fiction community might need some development tools atop the Bot Framework, e.g., authoring tools and software libraries.

I am excited about these ideas and I hope that they are interesting to you as well. Any thoughts, ideas, comments, questions, or suggestions with respect to these ideas for new tools for interactive fiction?

jsnlxndrlv · June 13, 2022, 8:45pm

Hi, Adam. Thanks for the interesting suggestion.

In the interest of starting the conversation, I guess my first question is as much for the community as it is for you: how viable would it be to create a Glk API interpreter for the Microsoft Bot Network? Unless I’m badly misunderstanding something, this would at least provide a healthy baseline for those authoring tools and software libraries since existing virtual machine games could be run directly from the MBN.

AdamSobieski · June 14, 2022, 10:43am

Introduction

Looking at the Glk specification 0.7.5, it appears that developers could, in the near-term, bridge Glk and the Bot Framework for text-based and hypertext-based scenarios. Many Bot Framework channels are text-based or hypertext-based and, after some software development, existing, current, and upcoming interactive fiction games could work across all of these channels.

In the longer-term, developers could use the Bot Framework to enable: (a) multi-channel, (b) smart-speaker, and (c) video-calling-based games. By multi-channel games, I mean games where players could pause play from one channel, e.g., a smartphone, and resume play on another channel, e.g., using a PC or a smart speaker – effectively playing across devices. This would be possible with server-side and/or cloud-based software engines. By smart-speaker scenarios, I mean uses of the Bot Framework to deliver user experiences on Siri, Alexa, Google Assistant, Cortana, et al. By video-calling-based games, I mean uses of the Bot Framework to deliver audiovisual user experiences across channels like WebRTC, Zoom, WebEx, or Skype.

In these regards, exciting possibilities with respect to the futures of the Glk specification include enabling speech-recognition and natural-language-understanding scenarios.

Speech Recognition

With respect to speech recognition, there exists a standard: the Speech Recognition Grammar Specification (SRGS). Speech recognition grammars define a subset of all natural-language utterances that a machine might expect. Dynamic grammars would be a relevant feature for interactive fiction games. As contexts change, so too could the grammars describing the instantaneously expected natural language. As contexts change, e.g., as players move between rooms, perhaps Glk (version next) could manage dynamic SRGS grammars – in a platform-independent manner – enabling speech recognition and conversational user interfaces for interactive fiction games.

Natural-language Understanding

With respect to natural-language understanding, there are also, more recently, approaches such as LUIS (documentation). Services like LUIS can process across paraphrases. So, utterances like “pick up the lamp”, “get the lamp”, “grab the lamp”, and so forth, could all be processed by services like LUIS which return the same data structures across paraphrases.

Conclusion

It appears that (1) developers can bridge Glk and the Bot Framework, delivering text-based and hypertext-based channels, in the near term, and that (2) towards (a) multi-channel, (b) smart-speaker, and (c) video-calling-based games, speech-recognition (SRGS) and natural-language understanding (LUIS) capabilities are possible on the horizon.

AdamSobieski · June 19, 2022, 10:43pm

On these topics, I also recently found a state-of-the-art speech-to-text and text-to-speech software which I would like to recommend exploring. It is called Coqui and it is available on GitHub.

Coqui speech-recognition components appear to utilize, instead of grammar-based recognition hints (SRGS), the use of hot-words, where certain keywords can be indicated to be contextually more or less likely to occur. These hot-words hints, in the case of words more likely to occur, could, for example, map to the visible things in a room and to relevant verbs or actions.

Coqui speech-synthesis components are, similarly, impressive, providing fine-tunable, prosodic, affective, and expressive speech synthesis. With these components, developers can also clone human voice actors’ voices.

In the near future, it should be possible to get state-of-the-art speech-recognition and speech-synthesis components, e.g., Coqui, interoperating with both interactive fiction engines and multi-channel frameworks like the Bot Framework towards building (a) multi-channel, (b) smart-speaker, and (c) video-calling-based games.

Tomas · June 20, 2022, 5:06am

Truly inspiring techniques. This could really up the game for interactive fiction.

Thank you for sharing this!

jkj_yuio · June 21, 2022, 12:08am

Thanks for this. I was hoping to try some TTS, but there are no pre-made models that i could find. Hoping to see how natural it sounds.

Looks like i’m going to have to install the code, and get it working. Unless there’s a demo somewhere that lets you enter some sample text.