|
|
|
|
|
by dataangel
945 days ago
|
|
No I suspect 3 seconds is often enough. It's not learning how you pronounce each word, it's seeing some pronunciations you use and from that guessing how you would pronounce other things based on how those pronunciations are clustered in its training data. In other words if in that 3 seconds it hears you say y'all it has a pretty good shot at inferring how you say a lot of other things. |
|
Apple's announced voice imitation feature requires a 15 minute sample if I remember correctly, and you can bet it would be shorter if they were satisfied with the results.
> if in that 3 seconds it hears you say y'all
I can guarantee that no 3 seconds of text of your choosing will ever be enough to reliably differentiate between all existing dialects and accents, let alone differences between individual people.
There's a reason all the decent fake voice clips on the web are based on public figures with hours of training material - and even with those you can tell within seconds that it's fake, even if you don't know how you noticed it.