| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fotcorn 950 days ago

The recently released XTTS-v2 model[0] from coqui.ai is coming very close to what ElevenLabs[1] can do. It runs reasonably fast on a recent GPU, and should also work on CPU. Requires a 3 second (!) clip of the voice you want to clone. License does not allow commercial use.

0: https://huggingface.co/coqui/XTTS-v2

1: https://elevenlabs.io/

1 comments

alpaca128 950 days ago

> Requires a 3 second (!) clip of the voice you want to clone.

Sure, if you want a guaranteed uncanny valley experience. There is no way a few seconds are enough to cover all the ways a specific person pronounces things. A person's voice is much more than just the pitch and with a 3 second sample anyone who knows them will be able to tell something's off within 3 seconds.

link

chankstein38 950 days ago

I know the parent comment said 3 seconds, but for what it's worth, the actual huggingface page says "a 6 second clip" which, admittedly, is still fairly hard to believe but I guess twice as believable as a 3 second clip.

link

Workaccount2 950 days ago

>anyone who knows them will be able to tell something's off within 3 seconds.

Unfortunately the targets of these scammers is usually senile old people. I'm incredibly worried about my parents, especially my bleeding heart mother.

link

JaDogg 950 days ago

Control her finances. Take away her passwords

link

dataangel 950 days ago

No I suspect 3 seconds is often enough. It's not learning how you pronounce each word, it's seeing some pronunciations you use and from that guessing how you would pronounce other things based on how those pronunciations are clustered in its training data. In other words if in that 3 seconds it hears you say y'all it has a pretty good shot at inferring how you say a lot of other things.

link

alpaca128 950 days ago

People speak differently in different situations, based on their mood and who they are talking to. They may change their voice when quoting another person, they may alter their accent if talking to a stranger, and there's many other details you don't even think about because you're used to them but it'll immediately sound wrong if it's not replicated correctly.

Apple's announced voice imitation feature requires a 15 minute sample if I remember correctly, and you can bet it would be shorter if they were satisfied with the results.

> if in that 3 seconds it hears you say y'all

I can guarantee that no 3 seconds of text of your choosing will ever be enough to reliably differentiate between all existing dialects and accents, let alone differences between individual people.

There's a reason all the decent fake voice clips on the web are based on public figures with hours of training material - and even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

link

sangnoir 950 days ago

> even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

Salesmen and scammers have many techniques to get people to act against their better judgement. Urgency is one: imagine a late night voice call seemingly from a loved one saying "I'm stranded, my battery is about to die. Please send the money now -" accompanied by loud environmental noises to mask any artifacts

link

lioeters 950 days ago

> it hears you say y'all

The accuracy or realism of the emulated voice probably depends on how "rich" the recorded material is within those 3 seconds? I mean, if it includes a single long syllable, or some phrase that contains a variety of vowels and consonants.

link

davidmurdoch 950 days ago

Do you know what an uncanny valley is?

link

Diris 950 days ago

They also have code to finetune the model

link