Hacker News new | ask | show | jobs
by belugacat 945 days ago
Right now this requires API tokens and being dependent on third party companies that will cut off your access if they decide they don’t like you.

The moment these models can run locally on the kind of cheap hardware that phone scam operations have will be the real Pandora’s box moment. (I give it 3-5 years or so)

4 comments

The recently released XTTS-v2 model[0] from coqui.ai is coming very close to what ElevenLabs[1] can do. It runs reasonably fast on a recent GPU, and should also work on CPU. Requires a 3 second (!) clip of the voice you want to clone. License does not allow commercial use.

0: https://huggingface.co/coqui/XTTS-v2

1: https://elevenlabs.io/

> Requires a 3 second (!) clip of the voice you want to clone.

Sure, if you want a guaranteed uncanny valley experience. There is no way a few seconds are enough to cover all the ways a specific person pronounces things. A person's voice is much more than just the pitch and with a 3 second sample anyone who knows them will be able to tell something's off within 3 seconds.

I know the parent comment said 3 seconds, but for what it's worth, the actual huggingface page says "a 6 second clip" which, admittedly, is still fairly hard to believe but I guess twice as believable as a 3 second clip.
>anyone who knows them will be able to tell something's off within 3 seconds.

Unfortunately the targets of these scammers is usually senile old people. I'm incredibly worried about my parents, especially my bleeding heart mother.

Control her finances. Take away her passwords
No I suspect 3 seconds is often enough. It's not learning how you pronounce each word, it's seeing some pronunciations you use and from that guessing how you would pronounce other things based on how those pronunciations are clustered in its training data. In other words if in that 3 seconds it hears you say y'all it has a pretty good shot at inferring how you say a lot of other things.
People speak differently in different situations, based on their mood and who they are talking to. They may change their voice when quoting another person, they may alter their accent if talking to a stranger, and there's many other details you don't even think about because you're used to them but it'll immediately sound wrong if it's not replicated correctly.

Apple's announced voice imitation feature requires a 15 minute sample if I remember correctly, and you can bet it would be shorter if they were satisfied with the results.

> if in that 3 seconds it hears you say y'all

I can guarantee that no 3 seconds of text of your choosing will ever be enough to reliably differentiate between all existing dialects and accents, let alone differences between individual people.

There's a reason all the decent fake voice clips on the web are based on public figures with hours of training material - and even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

> even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

Salesmen and scammers have many techniques to get people to act against their better judgement. Urgency is one: imagine a late night voice call seemingly from a loved one saying "I'm stranded, my battery is about to die. Please send the money now -" accompanied by loud environmental noises to mask any artifacts

> it hears you say y'all

The accuracy or realism of the emulated voice probably depends on how "rich" the recorded material is within those 3 seconds? I mean, if it includes a single long syllable, or some phrase that contains a variety of vowels and consonants.

Do you know what an uncanny valley is?
They also have code to finetune the model
Could work for spear-phishing, or impersonating a widely-known trusted figure. I can't really see it working for cold-calls that pretend to be someone the victim knows (like the terrifying ransom calls), since the operations work at a huge scale expecting most people to not even pick up a "scam likely" call. Even if model tuning is free and instant, just having to find a voice clip of the person prior to each unanswered automated call would tank the quantity they're able to make.

Though, for the same reason websites always attribute data breaches to a "highly sophisticated targetted attack", I imagine there will be some unevidenced claims that this is what scammers did to them - people don't want to have been fooled by something simple.

I just can't get excited over most trending submissions here on HN for the exact same reason over the last 2 years. This current advancement of AI doesn't feel like such a "new frontier" as other advancements in tech at their early adopter phase. The internet (the closest tech invention of similar magnitude imo) had an atleast decade long "wild west" before all the big players we know entrenched themselves with monopolies and legislation.

With AI we barely begun and yet the cards are already dealt.

I kinda wonder how close you could get with Core ML on iOS. Apple already ships an iffy voice clone software.