Hacker News new | ask | show | jobs
by pain_perdu 3396 days ago
How close (# years?)are we to being able to replicate the voices of any given individual with sufficient samples of their voiceprint?
4 comments

It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.
Do you expect derivatives of this to surpass the effort by Adobe with VoCo? From my untrained perspective they appear quite similar in functionality
I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).
A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?
That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell

http://people.ece.cornell.edu/land/courses/ece5760/FinalProj...

"My voice is my passport."
Sneakers (1992): My Voice Is My Passport

https://m.youtube.com/watch?v=-zVgWpVXb64

Verify me.
aahahahaha !
Afaik VoCo isn't creating anything from thin air, instead it scans the available voice data (it reportedly needs a sample of about 20 mins of a person speaking) and copies fragments of it in specific order to create a sentence.