Hacker News new | ask | show | jobs
by dataangel 945 days ago
No I suspect 3 seconds is often enough. It's not learning how you pronounce each word, it's seeing some pronunciations you use and from that guessing how you would pronounce other things based on how those pronunciations are clustered in its training data. In other words if in that 3 seconds it hears you say y'all it has a pretty good shot at inferring how you say a lot of other things.
3 comments

People speak differently in different situations, based on their mood and who they are talking to. They may change their voice when quoting another person, they may alter their accent if talking to a stranger, and there's many other details you don't even think about because you're used to them but it'll immediately sound wrong if it's not replicated correctly.

Apple's announced voice imitation feature requires a 15 minute sample if I remember correctly, and you can bet it would be shorter if they were satisfied with the results.

> if in that 3 seconds it hears you say y'all

I can guarantee that no 3 seconds of text of your choosing will ever be enough to reliably differentiate between all existing dialects and accents, let alone differences between individual people.

There's a reason all the decent fake voice clips on the web are based on public figures with hours of training material - and even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

> even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.

Salesmen and scammers have many techniques to get people to act against their better judgement. Urgency is one: imagine a late night voice call seemingly from a loved one saying "I'm stranded, my battery is about to die. Please send the money now -" accompanied by loud environmental noises to mask any artifacts

> it hears you say y'all

The accuracy or realism of the emulated voice probably depends on how "rich" the recorded material is within those 3 seconds? I mean, if it includes a single long syllable, or some phrase that contains a variety of vowels and consonants.

Do you know what an uncanny valley is?