So this is a bit of an AI question. Sorry, i know a lot of people don’t want to talk about it, but;
Recently, I’ve been thinking of adding voices to my games. These days, AI voices work by “cloning” peoples’ voices. Surprisingly, several of my friends have enthusiastically offered that i clone their voice expressly for this purpose.
Now, this is interesting because i can make whatever dialogue i want (and edit it). Also, as you can imagine, AI voice is somewhat limited outside of the spoken word. Say i need a blood-curdling scream. Well i can just strangle the original source and record it! Also, laughing, coughing all that stuff.
You might be wondering why i don’t just get them to sit down and read the lines. Well the answer is (1) they haven’t got the time and (2) dialogue changes. Mostly because of (2) I’m interested in this idea.
What do you think about this? Is this a massive AI overreach or is it like using grammar and spell check that’s ok.
Personally, i can’t see the ethical objection here. Or what am i missing?
If you’re trying to gauge potential reactions to the use of AI-generated audio, my assumption would be that most people will see it as similar to AI art from an ethical point of view. I.e., you could hire a human voice actor to do this, so choosing to get your speech recordings from genAI instead is buying into the same paying-big-tech-rather-than-creative-individuals problem that is one of the significant objections to other forms of genAI use.
The fact that you’re basing the generated audio on the voice of a willing participant is probably irrelevant, because I can’t imagine that any of the speech generation models being used to do this would work properly without being trained on a huge corpus of existing audio recordings, most of which was probably produced by people who didn’t intend for their voices to be used in this way.
Of course, you’re at liberty to do what you want, but I’d expect most people to see this as much more like art generation than translation or spellcheck.
Agreed - it’s fundamentally a scope and scheduling issue (something we’re all familiar with). Every project has the tradeoff between cool stuff, time, and logistics, and doing VO the old-fashioned way requires you to lock your script earlier than, say, the night before your deadline. One of the nice things about digital text is that you can do that at all (and in fact my own track record of getting my IF projects done much before then is poor) but it’s something I know I’d have to sacrifice if I ever moved into multimedia or physical media. This is, at its heart, a project management concern.
The question of whether AI voice cloning is ethical is a separate one. My first question is, do your friends have acting experience (voice-over or otherwise)? Acting is in and of itself an art to which the actor brings at least as much to the table as the director does - in addition to the ethical concerns @jwalrus brought up above re: training you’re also cutting an artist out of the equation and replacing them with a computer, which probably won’t sit well with an audience that doesn’t like that happening for text. And frankly, even if your friends gave consent if they’re not actors themselves I wouldn’t expect that to have much sway over audience opinion.
(And frankly, actors love to act so if you can build time into your schedule I think getting good voiceovers might be easier than you think.)
Do you feel like you friends will be okay with you having their voice to use however you like forever? Will they get approval over the lines you publish with their voice? Credit?
While the voice cloning gives you certain freedoms, it isn’t for free.
There are a lot of good opinions here already. Thanks a lot for people’s views on this;
Yeah, i think there are many ways to look at this, and i do see the “big tech” vs artist argument. However, i keep coming back to thinking that most of this is a small-time exercise that isn’t cheating anyone out of their dues because, if it wasn’t being done on the super-cheap, then it wouldn’t really happen.
It’s certainly a new idea for me to try adding voice. Although i did do a game some years ago with non-AI voice. It was rather hit and miss to say the least. We’re seeing some (almost) decent TTS at last.
The point about the original speaker being ok with the lines is mostly covered. In this case, they will probably be beta testers anyway. Although this is a good point to raise in general. Because i would feel uneasy making lines that perhaps they wouldn’t approve, or even that they had not seen.
It’s certainly the case that my source voices are not voice actors. But an important aspect is that, while people are willing to let me use their voice, they are not offering to take a day off work to sit down and record lines. So the option of having people be actual voice actors is slim at best.
I’m on the fence, personally. My take has traditionally been “focus on what it’s doing, not how it’s doing it”, because legislating about specific technologies is hard, and now it’s also hard to find a translator or spellchecker without the latest fad stapled onto the back. And in terms of what it’s doing, well—text-to-speech has been accepted in the IF community for ages. More of us use screen readers than in most other gaming communities, because we’re so focused on text over visuals.
So, is voice cloning fundamentally different from text-to-speech or vocaloid? If it’s done with the actors’ consent, I’m inclined to say no—turning text into speech is one of those things we’re long accustomed to using technology for, like spellchecking and translating words and short phrases. This is just a new way to do that.
But then, will the community here like it? That’s a different question, and I expect the answer will be “no”. The common sentiment here is very anti-AI and has been for a while.
I largely agree with Adam, Encorm, and Daniel. That said, I do appreciate that you care enough to ask what people think first. For whatever that is worth.
Unless I’ve severely misunderstood something, vocaloid doesn’t use human vocal samples, and is a purely synthesized approximation of a voice from first principles?
I think, if we assume you’re using a model that’s not been trained on stolen data and all this kind of general ethical concerns about AI, then it really depends on whether they are granting you the right to use their voice for just one project or forever, and for which kind of text (I guess they’d like to know if it’s used for e.g. something NSFW) and that everything is clear for them.
It’s seem more “OK” especially for small “indie” projects, and not if you’re a big corporation. I believe it’s the “forever” part and the like that really angered professional voice actors?
(I guess that line of thinking could apply to AI art or text or code, now that I think about it. But personally I think there are too many ethical and societal issues anyway, be it for art or voice or whatever.)
I think so—I’m trying to say, if we focus on the purpose of the system instead of its implementation (which is my general approach to AI stuff here), TTS, vocaloid, and voice cloning all take text and turn it into audio.
I’ve had good experiences hiring amateur voice actors from CastingCallClub for non-IF projects.
It’s a bit tilted toward fan dubs, but if you offer a decent rate you’ll probably get plenty of people auditioning. This is more expensive than buying AI voice credits, but probably just a few times more.
When I used it there was no subscription fee for the site.
(1) they haven’t got the time and (2) dialogue changes. Mostly because of (2) I’m interested in this idea.
Unlike with hiring friends, what amateur voice actors usually do is read several takes of each line. They typically offer a few retakes in their rate, so you can request small changes at any time. Then you edit it all together.
In fact, it might take more time for you to edit these takes together than it will for them to read the material. That was the real time sink for me.
Getting away from “is this ethically permissible” for a moment, I’d like to question what this adds, artistically. AI voice acting is pretty blah; like AI anything else, it’s never going to give you a line read that’s interesting or surprising. You may be excited by the novelty of having your lines read aloud, you may have noticed that “fully voice acted!” is a selling point for games outside the IF space, but what artistic horizons are the world’s most predictable line reads really opening up for you? Is your audience actually going to enjoy this?
While I think I get you, I don’t see these as equivalent at all. My screen reader is not just a handy utility, it’s my entire access to the computer. Without it, there would be no games for me to play. But as crucial as it is for my access to others’ art, it’s not the art itself. I’ve just started playing Disco Elysium, and it would be a far emptier experience to have a synthetic voice read out all of that wonderfully acted dialog. Of course, generative TTS is of far higher quality than a vocal tract moddle designed for imbedded systems from the 90s, but as pointed out in the OP, human acting is better and more versatile than generative TTS.
I hope I’ve not come off as prickly; I’ve lurked in the many AI threads on here, but haven’t participated for various reasons. Just as an ameture voice actor, an author, and a screen reader user, I’m a little miffed by this comparison. I for one would love to offer my skills to anyone’s project.
Oh, I don’t at all mean to say screen readers are equivalent to AI-generated voice acting! And I’m sorry if it came off that way.
I’m trying to say that the IF community has historically been more interested in text-to-speech technology than things like LLMs, specifically because it’s so useful for accessibility. So I predict AI used to voice act will be received more like AI-powered translators and spell checkers (where the consensus seems to be “it was better without LLMs but using it shouldn’t disqualify people from comps”) than AI-powered text and image generation (where the consensus seems to be “we don’t want that anywhere near our community, banning it will make everything better”). Text-to-speech is fundamentally something we’re familiar with, even if this is a drastically different usage of it, while LLMs for text generation have no real precedent. I could have my old espeak program do voice acting for me if I wanted, even if it wouldn’t be any good at it.
But I could be totally wrong about that. Maybe the community as a whole will love this, maybe they’ll hate it.
Could the voices be optionally turned off in the game? I’ve played another IF game in the past, where voices were reading through the multi-character dialogue. Slowly. Way slower than I wanted to read it. I was not happy. It felt like the audio version of timed text. So yes, I’d want this to be an option I could turn off if it was implemented.
If its a choice between hiring actual voice actors and paying one of the big tech companies for cutting edge speech generation, then I would prefer real voice acting.
If its a choice between a text-only presentation that requires I use my screen reader and a handful of volunteers feeding their voices into one of the free models for speech generation, I’ll take the AI voice.
I’m not sure the IF community’s historical interest in and acceptance of TTS technology necessarily transfers onto something like ElevenLabs, even setting aside the technical differences between the two. Admittedly I’m not active in any IF communities at present, and I definitely haven’t played even a fraction of the IF out there, but I don’t recall ever having played a game where the TTS was some kind of independent experience of its own, the way reading with one’s eyes or ears can be very different experiences. It’s a really convenient feature when it’s included, often not as author-designed self-voicing tech, but as adapted output intended for third-party software like a screen reader or a terp talking through an OS speech driver. Which, personally, is the way I like it, if we’re just talking about reading on-screen text. Others might have different preferences, or use TTS in different contexts. I’d absolutely be down to see more experimentation with TTS in games, especially given how easy it is to do in the browser. I hope that acceptance you mention encourages and drives more authors to be creative with their TTS.
I just don’t think this is it, though. I guess there’s a conversation to be had about the intersection between voice acting as an art and spoken dialog as an accessibility solution, but I can’t forget that we’re talking about art. It feels like a category mistake to call a human a text to speech engine, even if technically any audio book narrator, or voice actor, or parent reading a bed time story is converting text into speech. If voice acting and TTS are different things, I would not use TTS for the voice acting in my game. Some do as a deliberate artistic choice, and maybe the generative equivalent of that will be received well, but I think this is closer to LLM pros and AI-generated art than it is to folks dropping links to professional VAs and asking how to indicate bold text to a screen reader.
Somewhat off topic, and maybe a bit beyond scope, but even something like Vocaloid (mentioned above) is more like a musical instrument than those things that are like instagram filters for your voice. You have to compose for it, lyrically and melodically, using an understanding of music gained through experience of music. I haven’t used it in any of my compositions yet, but I’ve hung out with artists who do, and it doesn’t look like a very automated process at all.