Hacker News new | ask | show | jobs
by minimaxir 996 days ago
This article only covers the musical aspects of AI voice cloning, but there's another dynamic to AI voice cloning that's more complicated: replacing general voice actors in movies/video games/anime (example: https://www.axios.com/2023/07/24/ai-voice-actors-victoria-at... )

Unlike musicians who can't be replaced without significant postprocessing, have enough money to not be impacted by competition, and have legal muscle, voice over artists:

- Can be reproduced with good-enough results from out-of-the-box voice cloning settings on ElevenLabs or an open source equivalent (Bark, VALL-E X)

- Are already underpaid for their work as-is

- Have no legal ownership of their voice since they are contractors, and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

I want to write a blog post about it but I suspect most people on Hacker News won't be interested in a treatise on the cultural impacts of the voicework in Persona 5 and Genshin Impact.

12 comments

What I find interesting is this aspect that eventually, these companies will hire some college kids who needs a couple thousand bucks and a free pizza. Have them read the right scripts. Sign the right 'give everything away' contract and just forever use their voice. Or do it sneaky. Have a voice assistant and in your ToS 'we can use a copy of your voice for anything'.

The existing voice actors will be just out of work. There will be a small cadre of groups that want real voice. But for some projects that will not be that important.

Its going to get crazy.

They don't need that - they already have enough data to generate plausibly human voices that don't sound like anyone in particular.

Voice cloning is a special case, these models are equally good at making new voices.

I’ve found it’s not actually as easy to get this stuff to sound different to the specific someone it’s trained on.
Don't expect that to last more than a year or two, assuming it's even still a problem for the best voice-generation AIs. Generating high-quality is the hard problem; generating specific high-quality samples is, by comparison, a lot easier.

Remember when Stable Diffusion was released a year ago and one of the big artist copes was "sure, it can generate random images, but it'll never be able to generate the same character repeatedly!" They were already wrong because Textual Inversion and DreamBooth were already published, and soon enough, ported to SD and now people could dump out thousands of images of the same character in the same consistent style etc (and did).

The issue is more that I can’t get the equivalent of a slider control to adjust one or more properties of the voice from the AI in real time… like a vocal fry slider to use an example of something most people are capable of deliberately doing when they want to… but the currently available models are pre-trained to sound like the average/median of one specific person (or character) and while I imagine tools will improve to control and customise the training of the models to customise this vocal output I don’t see a clear path from the current model architectural design to one where I can freely control the stylistic expression aspects of the vocal output without loading in a completely different set of model data trained for that new desired output.
No, that's easy. We had the equivalent of that in GANs many years ago. If you've never seen GAN editing, here's a quick video: https://www.youtube.com/watch?v=Z1-3JKDh0nI (Background: https://gwern.net/face#reversing-stylegan-to-control-modify-... ) You just classify the latents and then you can edit it. These days, with pretrained models like CLIP, you don't necessarily even need a latent space: you can take a model which has been trained on sound/text descriptions, like AudioCLIP, prompt it with a text like "vocal fry", and then the generated samples are subtly skewed to try to maximize similarity with "vocal fry". You put a slider on that for how much weight/skewing it does, and now you have a slider control to adjust properties of the voice from the AI. If something like this doesn't exist, it's obvious how to do it. (Even the realtime problem is being solved by figuring out how to train diffusion models to do a GAN-like single pass: https://arxiv.org/abs/2309.06380 )
I have said this will initially be sold as a feature on things like Audiobooks.

Pick your book, pick your reader and away it goes. The Diary of Anne Frank read by Gilbert Gottfried.

Not sure if your hypothetical was meant to be a reference to the absolutely hilarious classic “Gilbert Gottfried reads 50 Shades of Gray”, but it has me wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines.

https://youtu.be/XkLqAlIETkA (Extremely NSFW without headphones)

> wondering how much of the inherent comedy comes from “the voice” and how much comes from the idea that the man himself sat down and recorded those lines

For me it came from the voice; I hadn't heard of Gilbert Gottfried as a specific person until I read this discussion. The reaction faces of the women listeners were also amusing.

I still like getting surprised when a new or unorthodox narrator knocks it out of the park but I’d really enjoy a “salvage this purchase” exit hatch with a AI voice alternative. I’d even pay a buck or two on top of an existing purchase to automatically fix a bad narration.

Head over to Audible reviews, some books are widely considered to be great books as written but the audiobook is reviewed as one to be avoided because it was recorded poorly, the narrator paced it wrong, they had an annoying voice, they couldn’t do a voice of the opposite gender, whatever.

Plus it seems like a great accessibility feature. Many books are recorded for the vision impaired community by volunteers and that’s admirable, but some of the AI today does a much better job.

These are some very fair points. There was one book 'Electron Fire' all about the creation of the transistor, I think. I say that because never have I heard a more unenthused narrator. Makes Henry Kissinger sound like a dramatic actor.

Any AI voice could save that one. Any of them! Heck the original voice on the 1984 Machintosh could do better.

Recent voice models by OpenAI, Meta, and ElevenLabs all state upfront they work with paid professional voice actors, so this space will get intetesting fast.
Mozilla has a voice data project where people already do it for free(dom) ;)

https://commonvoice.mozilla.org/en

HN isn't the only community to write for. While most people here seem to be unsympathetic to such job concerns, unconventional articles do hit the front page from time to time.

I'd like to read it, in any case.

The get rich at any cost type like to post on these articles at a higher rate I think. When you read a larger and broad range of HN posts you see a substantial part of the population here has concerns about this.
+1, I would also like to read it
I would as well. It isn't that I'm unsympathetic, it is just that we haven't outlawed technology that put others out of work, and I'm curious why we would decide as a society this time should be different. If there are good reasons I want to know.
Putting people out of work is one thing, that's bad enough and societies should take care to guide change and support those affected.

The danger behind AI and other manipulative technology is that it erodes trust. We already have serious issues with trust in media, and not just the obvious cases of Russian/Chinese propaganda, but also stuff like kids getting anorexia from extremely photoshopped advertising.

Add AI on top and no one can be certain about anything anymore. Say someone distributes a fake "recording" of the US President calling for glassing Moscow, or the Serbian President declaring war on Kosovo? That has the potential to actually cost lives on a massive scale.

Yeah, all that is bad, but those consequences are already here aren't they? Restricting further research just means it will be done in clandestine government labs like chemical or biological weapons except with equipment costs orders of magnitude lower. I can imagine policies that would save the jobs of voice actors, but none that would prevent the wave of deepfake propaganda that is coming.
Voices are uncopyrightable, but impersonation isn't legal (see Midler v. Ford, for a notable case), so I don't think the situation is totally clear.
> voice actors are fearing that the ability for generative AI to replicate their voices may cost them work

I'm not sure how to feel about that. I'm against the idea that some people "deserve" being paid for being lucky born with an interesting voice.

On the other hand, the world always worked like that. And, say, hard-working farmer or doctor were also lucky being born with necessary traits to make for their living, while others weren't.

Voice acting is more than just talking into a microphone. It's a skill not limited to the quality of voice.
A lot of skills are not simple, but computers have taken over them anyways. For example, financial bookkeeping is not just writing and storing the books, it's a professional skill with many tricks to learn. However, databases and spreadsheets have taken the major part from those jobs. Same could be said about programmers who learned the skill of programming Assembly language. Or performing -- vinyl records and CDs has largely taken over orchestras and traveling musicians.

I would vote for it only if it somehow encouraged voice actors to experiment and create new interesting styles. Kinda like patents were designed to do -- encourage inventors (although recently it became controversial in IT world).

Yes, everyone has a voice. The amount of people who can convincingly act with said voice is remarkably small and requires a good deal of innate ability or training, generally both.
You could have made that argument more effectively in the past when voice actors had to be able to mimic multiple voices (Dan Castlenetta, Mel Blanc, etc.). Nowadays, we're seeing more and more shows where the voices of the characters are just... the normal voice of the voice actor.

Of course it's not totally devoid of skill, you need to be able to emote, inflect, and convey emotion, but the bar is far far lower.

I argue that emote is more important skill than switching multiple voice character, though latter contributes get more jobs
> that some people "deserve" being paid for being lucky born with an interesting voice

Majority of success is attained like this though. Athletes paid for being born strong tall and fast, models paid for being pretty, rich families being paid for being born rich, smart people being paid for being born smart, or hardworking, etc. It's the most dominant factor everywhere.

It's always funny to me when people cite old American case law and try to wrangle their heads around how that can apply to a situation which the case's participants couldn't have possibly imagined. Shouldn't the correct way to do this be new legislation being created after consulting interest groups to answer the modern problems which exist due to modern realities, like what the EU is doing? It seems much more sensible of an approach instead of wondering how a 15th century ruling's ruler would have applied his thinking about something they couldn't even dream of.
Interest groups == lobbyists in this case. Which might explain some of the American hesitation.
Well yes, you need to ask representatives of the people that will be impacted by a law what the impact will be, assess expert opinions, etc. Lobbying isn't only the American political bribery system, there's legitimate reasons behind it.
Of course! And that those with the deepest pockets are able to afford to have the most convincing folks spend the most time waiting for an opening in the various Representatives calendars is not surprising, and only natural.

That it often results in them getting an equivalent mindshare (or more) of the Representatives views is also not surprising, and only natural.

It doesn't inspire warm fuzzies in those too busy working to survive though.

Your government class didn’t cover common law versus case law?
You probably mean common law, also sometimes known as case law, vs civil law which traces it's origins to the Napoleonic civil code, and which is used in all of the world outside of the former British colonies.

My law classes did cover common law, yes, but not favourably(can you guess I come from a civil law country?). Sounds like a system that made sense in 15th century Britain, but is quite the complex beast with many issues nowadays when it doesn't need to be.

However that still doesn't answer my original question, why is there no new legislation to cover the newly existing scenarios talked about? It seems to me that even the UK does that at least for some things, and they're the original common law country.

> Sounds like a system that made sense in 15th century Britain

Eleventh.

As long as they don’t claim the voice is the original actor (misspell the name perhaps, or the Hollywood classic ‘based on’), they won’t be impersonating no?
The Ford ad didn't say it was Midler, they just implied it by using her song with a soundalike. There was another similar case with a parody ruled as impersonation. I don't think there's good precedent for exactly where that line is drawn.
Fascinating, thanks!
Interesting note: many Vocaloids (most notably Hatsune Miku) are sampled from voice actors rather than singers.

Singers didn't want software clones, but voices actors are fair game.

I have a different take on this.

AI voice is cheaper, but it's also a more boring and generic performance. There is zero progress made towards any sort of creative AI that produces good unique work.

The market for this then is small businesses who can't afford a professional voice actor. AI is opening up new markets, not killing the jobs of the truly talented.

This is the case for all generative "art." The people at the high end will still get paid well. The people who specialize in more utilitarian or low budget tasks in higher volume will take the biggest hit. Nobody who'd planned on hiring Morgan Freeman to do a voice over will be tempted to use AI Morgan Freeman instead.
The MVP might have the free "good enough" AI voiceover, it takes less money to bootstrap a new product that way.

The real product would have a real voice over actor paid for with VC money.

>There is zero progress made towards any sort of creative AI that produces good unique work.

It's only been a year. Give it some time and I'm sure AI will have much better results. Right now, you can get some of that unique work by finetuning the AI off of a person's existing portfolio.

I am interested! You should write about what you find interesting; never worry if it will interest a particular group.
It saddens me because of how much impact they had on my family as we played through the story line in Genshin and immersed in the world. At some point we met a few of the voice actors at a convention and they were like stars to us, while I'm sure their circumstances are as you describe.
I'd be interested.

Most likey you'd see a lot of people saying that somehow getting rid of voice actors is good for "progress". Whatever that means.

Random aside someone really needs to make a hackernews that focuses more on game development and other arts so blog posts like your talking about would have a proper community to discuss them with.

Replacing voice actors with text-to-speech is good because it lets you do things voice actors can't:

* Create dynamic new voice lines at runtime, for example game characters reacting to new situations.

* Operate at a scale that's infeasible for humans, for example turning every ebook into an audiobook.

Which are, in my view, really minor advantages when compared to the disadvantages. Not only in terms of putting people out of work, but in terms of increasing the artifice of the world around us and decreasing its humanity.
"putting people out of work" by automating jobs is also a good thing.

The amount of stuff humans can accomplish is strongly limited by the supply of workers. Automating one job frees them up to do other things.

> "putting people out of work" by automating jobs is also a good thing

Unless you're one of the people out of work. And even if you don't care anything about them, if there's enough of them then the resulting unrest will be your problem anyway.

There's little nothing more important to the happiness of humanity than increased productivity per capita. That sounds crazy but when you think about it it's true.
Well, this is a very one sided view on the world I'd say. From personal experience, I can surely tell you that I was much happier in countries where productivity was lower. The people there are just so much more pure of heart.
It's fine to visit, but in every measure of happiness people in poor countries are more lonely and report worse life satisfaction.
> That sounds crazy but when you think about it it's true.

I've thought a lot about it, and I don't think it's true.

> and their voicework is owned by their clients who may not be as incentivised in protecting the VO.

The work product produced by their voice for fulfilling the contract is owned. No corp owns someone else's voice.

Porperty is a bundle of rights, and often hard to pin down. In the case of voices, if a company owns enough of your data to train a good simulacrum, and they have the right to do it, then they kind of do own your voice -- or more precisely, a damn good substitute.
Case in point, Luke Skywalker / Darth Vader in the D+ series: https://www.vanityfair.com/hollywood/2022/09/darth-vaders-vo...

> Belyaev is a 29-year-old synthetic-speech artist at the Ukrainian start-up Respeecher, which uses archival recordings and a proprietary A.I. algorithm to create new dialogue with the voices of performers from long ago. The company worked with Lucasfilm to generate the voice of a young Luke Skywalker for Disney+’s The Book of Boba Fett, and the recent Obi-Wan Kenobi series tasked them with making Darth Vader sound like James Earl Jones’s dark side villain from 45 years ago, now that Jones’s voice has altered with age and he has stepped back from the role.

Copyright is complex. And artist's rights are outside of copyright, in some respects. An example.. in the past, painters have had their works bought, and then hung in unfavourable conditions. Or in places/locations, which reflect poorly upon the work of art.

Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.

Now, everything you say is copyright... you. At least in my legal jurisdiction! Even my image is, in Quebec! Yes, that includes if you take my picture outside.

So what of one's voice? And if you don't have a real agreement, to use that voice in any way desired. And then you use that voice to.. I don't know, advocate for terrorists or something weird.

What then?

I don't think it's completely clearcut, and I think there will be changes, decisions on this going down the road.

We've seen plenty of examples of famous people suing companies for using their likeness in ads as if they are promoting a product. Tom Hanks' name is currently in the news for this.

If a company uses an actor's previously recorded dialog to be edited in a way that makes them sound in favor of terrorism on the attempt to have people think the actor said the words, we have issues on so many levels. If the dialog is chopped/re-edited to use as dialog for the same character in later works, then I really don't have issues with it.

I pay little attention to SAG contracts, but after the Writer's Guild strike, I'd be expecting SAG to follow suit with major asks to protect its members from AI if they have not already covered it.
The current standard is the NAVA AI Rider: https://navavoices.org/2023/01/23/artificial-intelligence-ri...

NAVA also has guidelines for protection against AI abuse: https://navavoices.org/synth-ai/

thanks. i have recently been asked by a couple of acquaintances that have done a few character voices in the past what I thought on AI and what can really be done with it. because of their infrequent performances, they aren't union members, but I'll pass along these links.
> Artists have sued, and won, to have artwork moved, shown differently, or force-sold back to the artist.

That seems insane to me. Do you have specific examples?

https://en.wikipedia.org/wiki/Moral_rights

"Independent of the author's economic rights, and even after the transfer of the said rights, the author shall have the right to claim authorship of the work and to object to any distortion, modification of, or other derogatory action in relation to the said work, which would be prejudicial to the author's honor or reputation."

https://en.wikipedia.org/wiki/Authors%27_rights

"The authors of dramatic works (plays, etc.) also have the right to authorize the public performance of their works (Article 11, Berne Convention)."

"The protection of the moral rights of an author is based on the view that a creative work is in some way an expression of the author's personality: the moral rights are therefore personal to the author and cannot be transferred to another person except by testament when the author dies."

"“Author” is used in a very wide sense, and includes composers, artists, sculptors and even architects"

Architects can deny changes in interior design: Lighting, artwork, etc., long after the building is finished. Just a few days ago I talked with a theater director: The author of the original work has the right to deny a production, for whatever reason, e.g. if they don't like the nose of an actor.

I bet my voice is mine under most jurisdictions (and I mean most; the Berne convention has been signed by 181 countries), even if I signed a contract that gives you wide permission to use it. And if I didn't, you can't use it outside of the very narrow scope of the work I produced for you. Even if you simply want to reuse an existing recording in another context.

They don't own the voice, but they own the vocal performance, which ends up being a meaningless legal distinction in practice.

It's one reason why VAs rarely take fan requests for a character they voice.

If they are using their real voice, then they kind of screwed themselves. If they are performing a character voice, then at least they only lose out on that kind of work.

I'm guessing contracts will need to be updated to say that a character's voice made from AI can't be used so a completely different production cannot say they have the actor attached for publicity purposes.

No one owns a voice at the moment. There is no mechanism in the US to own a voice, even your own.
A person's voice is effectively owned by the corresponding person through right of publicity, which includes voice depending on jurisdiction.

California, for example:

"Any person who knowingly uses another’s name, voice, signature, photograph, or likeness, in any manner, on or in products, merchandise, or goods, or for purposes of advertising or selling, or soliciting purchases of, products, merchandise, goods or services, without such person’s prior consent, or, in the case of a minor, the prior consent of his parent or legal guardian, shall be liable for any damages sustained by the person or persons injured as a result thereof."

https://leginfo.legislature.ca.gov/faces/codes_displaySectio....

Voices can sound very similar, they're far from unique. Clearly if you say or somehow strongly imply that a voice belongs to a specific person then that is protected. But what if you use someone's voice, someone not especially well known, and don't make any claims about where it comes from?
It's still not necessarily legal just because you can get away with it.
I don't think it's that clear at all. You own your "likeness", but the limits of what that means is highly untested. Of the similar examples that have been tested in court thus far, the Ford v Midler case is the closest, but the court specifically called out the fact that as a singer her voice is a distinctive part of their identity, and so it is protected.

https://en.wikipedia.org/wiki/Midler_v._Ford_Motor_Co.

<raises hand> I am
Please do. Some of us critique capitalism
It's sad if the only way voice actors are going to be able to make a living is by doing stuff like Critical Role on Youtube. I love Critical Role but it likely wouldn't be the same if those guys hadn't spent years honing their craft. Watching people play RPGs online has replaced a lot of my streaming viewing now, but the market is much smaller and I imagine it can only sustain a much smaller pool of creatives than the current voice over market can.