AI voice cloning in games

Vocaloid and similar software use vocal fragments from real people, and their sounds get synthesized into speech or song. You could try to approximate these voice banks into sounding like the people again or make something that people can’t truly do.

The most famous Vocaloid is Hatsune Miku, and she’s voiced by Fujita Saki. The voice actor is quite fond of the character she lent her voice to, and the music that came out has been very creative. As an aside, many music producers in Japan today came from the Vocaloid space.

The more relevant software would be Voiceroid, which does approximate quite well to spoken Japanese, and CeVIO Studio AI where voice banks can be used for English-language speech. But many people who use these programs are using it as a stylistic TTS, adding flourishes no human being will ever use (everyone has more or less agreed that Zundamon should always end their speech with -nanoda).

Now, there are several games that use these software:

The former is a Touhou fangame, a community that has traditionally been using a TTS. The latter is a bunch of characters from the Voiceroid family. These have not been controversial as far as I know.


I say all this to say that the technology and the intention of Vocaloids and friends are slightly different. Vocaloid and Voiceroid users are often tuning their voicebanks to do their let’s plays of Minecraft or make songs.

From my understanding, vocal cloning as it is used in entertainment industries is just copying the voice of a celebrity as-is.

This has different implications: if Chuck Norris was a voicebank established in the same way as Vocaloids were, his voice would be used as the vocals of a song or a sketch comedy. In other words, as part of original content by the author. The cloning stuff as it stands now may be more closer to deepfakes, creating things he could say.

There is no reason to believe that the practitioners of vocal cloning technology are going to be malicious forever, of course. I can see a future where the companies might create good contracts with actors for vocal cloning, and the users are still responsible for their actions the same way Vocaloid and others are done.


As for whether it’s useful for interactive fiction, I think we have so many varying opinions on audiovisual elements in IF in the first place that even if the “AI stigma” fades away, people will still feel like they need to grumble about.

I’m not sure if adding voice acting will be a bonus for a lot of IF games too (imagine Eat Me voice acted). A game that’s going to have voice acting should consider how voice acting will elevate the script.

8 Likes

This feels like an extension of the much older, more fundamental question of “is it okay to get something for free from a friend instead of paying a professional to do it?” To which those working professionals will frequently say “no, this devalues our craft/industry!” (or at least they do in photography circles, where I dabble). But I’m not sure if bringing this strictly commercial angle into the IF sphere is a can of worms we really want to be opening.

1 Like

So, for some context: I’ve spent a decent chunk of my adult life in proximity to community theater. I’m not an actor (I like to help out backstage when I have the time and energy), but my wife is and our social circle is full of other theater people.

Acting is a funny art form in that in that it’s really hard to do by yourself. I’m not talking about taking classes and learning, but about actually participating in the art itself - typically you need to audition and get cast first before you can even start! And because of that every community theater group I know is absolutely slammed come audition time, because people really love the art of acting and want to get a chance to do it. These aren’t paid opportunities either! People are lining up around the block for a chance to give up their free time for the next several months (some of them with a long commute) just for the love of it.

This is partially a result of knowing the right people, but in my social circle I know plenty of very talented actors who would jump at the chance to record lines for a project like this, because it would mean doing something they love. I’d be sad if small projects like this turned to AI because that’s one less venue for people like my friends to enjoy their hobby.

(There has been a lot of discussion the last few years about fairly paying artists in the face of AI but it’s mostly been about visual art and music. That doesn’t map 1:1 onto acting, especially for free projects. People should be paid fairly for their labor but the existence of community theater makes what people consider “fair” into a different beast.)

10 Likes

Just a few months ago, in July, SAG-AFTRA approved a new Interactive Media Agreement, thus ending a strike by videogame voice and motion-capture actors. Unsurprisingly, AI was the central issue in that negotiation.

The upshot is that AI voice work is part of the industry now, and it needs to be used in a way that respects the performer.

I haven’t read the entire agreement, but summaries (here, here) say:

Objectively Identifiable : Digital Replicas must be “objectively identifiable” as the performer, including in the role of any in-game characters.
Usage Reporting : Further, in addition to other requirements, covered game producers will need to provide usage reports detailing how the Digital Replicas have been used in their game.
Consent : Performers must consent to the use of their Digital Replica, and producers must provide a “reasonably specific description” of how they intend to use the Digital Replicas.

Also, replicas used for “real-time generation” get a higher payment scale.

I like to think I contributed in a small way to the terms. A couple of years ago I was part of an informal meetup of game developers and SAG-AFTRA people, and I got to describe the specific ways in which AI voice generation could benefit IF and narrative games.

I focused on the idea of line variations. It’s completely normal in IF to write code like

say “You [one of]say[or]recite[or]invoke[or]speak[at random] [the noun][one of]. Nothing happens[or], but nothing results[or]. There is no effect[or], but there is no effect[at random].”

(A line from Hadean Lands. Note that “the noun” is the name of a magic formula, like “the Chi Binding” or “the symmetric sequence”)

This is, like, table stakes for text games. It’s two minutes of effort and provides a pleasant bit of variation for a failure message that might otherwise get boring. (The player will see this one a lot.)

But of course in a voice-acted game, you’ve just multiplied your budget by eight, and that’s ignoring the “noun” substitution – which is effectively impossible; the game has about 25 formulas.

Anybody working on a voice-acted game just knows that they have to work around this stuff. It’s not limited to IF games either. You can imagine playing a stealth action game like Dishonored where the guards have barks like “Suspect spotted in [street name], wearing [clothing article]. [Count] guard[s] down.” Or “Hey, [he’s / she’s] up there! On top of the [phone booth / streetcar / telephone pole]! With the [weapon]!”

This is absolutely within the range of AI voice tech, and it should be covered by the SAG-AFTRA agreement as well. (Under “real-time generation”.)

The thing about this kind of variation is that you could start with a real line reading. You’re adjusting word choice, not inventing dialogue wholesale. I think there’s a lot of potential for even this limited kind of text generator.

14 Likes

AI voice cloning isn’t inherently unethical, especially when obtained with clear and informed consent. Of course, it’s worth recognizing that voice isn’t just another asset. It’s personal identity. If you do this, I would just say make sure your friends know exactly what their voice may say or do, and that they can withdraw consent later. Think of it less like using spell check and more like borrowing their face for a character. Voice cloning isn’t new ethically. It’s just newly democratized.

3 Likes

Yep. Computers calculating pi to a trillion digits didn’t put a legion of mathematicians out of work. Similarly, AI voice “acting” for IFs with high textual variability doesn’t replace professionals – it opens up the possibility of a job that no one would ever get humans to do.

But I agree with those who’ve suggested that the outcome, at present, would probably not have a long shelf life. Once the novelty of being able to do it at all wears off, the quality is likely to be pretty slop-level.

1 Like

Obviously the people doing computations by hand (the original meaning of “computer”) were put out of work. Pretty much every time computers become better than humans at something, a profession is gone, with the notable exceptions of professional chess players and stock traders.

AI voices are already making it tougher for actors to find audiobook work. Even with better contracts, it only means a few famous actors get paid for their AI clones, while the lesser known ones are still out of a job.

This doesn’t necessarily mean that it is wrong to have an AI voice your interactive fiction, if you can’t do it yourself, can’t afford to hire somebody, and you’re sure it will make your game better.

5 Likes

Not the work I mentioned, of calculating pi to a trillion digits – that was a “job” that was never feasible on a human scale, any more than the Panama Canal or Three Gorges Dam would ever have gotten built by mass manual labor. Automation advances destroy some existing jobs, and they also open up possibilities that would never have been in reach without machine assistance.

I recognize that using those kinds of examples in the context of IF is a little grandiloquent…but the reality is, as zarf pointed out, a text IF with commonplace levels of text variation could simply never be voice acted by humans at a cost-effective level. If AI gets better at voice generation (getting beyond the kind of uncanny valley slop I see in most AI-generated online ads/video) having the whole thing voice-acted will become possible for the first time ever.

It’s not unreasonable to want to boycott all AI in the name of protecting creative jobs. But it’s worth being clear that if we did that successfully, we’d also be foreclosing the possibility of some work that’s beyond human capacity. The answer to someone like OP (assuming their games have a textual variation level closer to Hadean Lands than e.g. a Telltale game or average VN) could never actually be “hire humans to do the whole thing.” It would (without AI) either be “radically simplify your dialogue to make this cost-effective” or “sorry, that’s an unrealistic goal.”

3 Likes

Plus there’s a big difference between a billion dollar studio that can afford a few million for voice acting deciding they’d rather pocket those millions and a solitary writer making an IF in their spare time with zero budget and no idea where to find volunteer talent, and I would assume IF devs are mostly closer to the latter.

Admittedly, it would be great if it was easier for those writing stuff for the fun of it and those doing narration/voice work/stage performance for the fun of it to find one another, but I think the collaboration coordination problem is mostly separate from the AI unemployment problem, and that creatives using AI for things outside of their wheelhouse they lack a collaborator and have no budget to hire anyone for doesn’t really contribute to either problem.

2 Likes

I agree, but I also really think it’s worth sticking to zarf’s point here. The difference between big studio and indie dev, or paid and volunteer voice work, is basically irrelevant for many of our games – because the combinatorial complexity of those games’ text makes it realistically impossible for anyone (rich/poor, contractor/volunteer) to do a human-read audio version of them in the first place.

I write in ChoiceScript, not parser games, but that doesn’t make VO as simple as reading out a bunch of text blocks. My first game was hundreds of thousands of words long, and had a lot of in-line variation throughout, based on things like the character’s social class, gender, relationship with the other characters, and past actions.

My sequel-in-progress doubles down (God help me) on both the length and the complexity, with lots of variation based on who’s accompanying you in a given scene. (Including plenty of cases where whoever happens to be there delivers the same line – which is a little more efficient in terms of text, but might easily quintuple the voice acting demand for the scene.)

Those games can be screen-read by machine, without (at present) much in the way of feeling or nuance or vocal variation. To get them narrated by humans (without dumping the variety) would take way more hours than anyone could ever justify spending on voice acting. Even if you just focused on the dialogue, and cut features like the reader’s ability to input a name of their choice.

Obviously not all text-based IF has that kind of variability…but a whole lot of it crosses the complexity threshold where human voice acting would be prohibitively time-expensive (especially if you had to justify the cost against the expected income from your audience…but honestly, I don’t think we’re in remotely reasonable labor-of-love volunteer territory for much of it either).

And that’s relevant to @jkj_yuio’s original question. If we take AI voice generation off the table, we’re not (in these cases) providing work for human voice actors; we’re just enforcing the practical impossibility of a voiced version.

2 Likes

I’m just not sure how much any of this matters when the current state of the technology is that it’s pretty mediocre, at least at a level that a hobbyist has access to. It’s not adding anything but novelty value, and that will rapidly be lost if more people do it.

We can already see this with AI art—it’s not leveling the playing field with games that can afford (whether with time + skill or money) handcrafted art, it’s just creating a bunch of games with kind of mediocre art where a lot of players are like “I would enjoy this more if the mediocre and kind of ill-fitting illustrations weren’t there.”

10 Likes