| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pain_perdu 3396 days ago
	How close (# years?)are we to being able to replicate the voices of any given individual with sufficient samples of their voiceprint?

4 comments

PieSquared 3396 days ago

It's hard to say! We don't quite know exactly how many parameters or minutes of audio are needed to describe fully someone's voice and speaking patterns. Maybe one or two, maybe much more.

link

hulahoof 3396 days ago

Do you expect derivatives of this to surpass the effort by Adobe with VoCo? From my untrained perspective they appear quite similar in functionality

link

PieSquared 3396 days ago

I don't quite know what VoCo does, but it seems like a concatenative system that they've tuned a huge amount. I'm a little skeptical that it works as well and as reliably in real life as it does in demos. But, even so, there parametric models tend to be much smaller in size and more flexible, so there may be applications where WaveNet-style systems are applicable in ways concatenative systems can't handle (high quality on-device TTS, emotive TTS, speaker synthesis for new unheard speakers, etc).

link

amelius 3396 days ago

A simpler problem could be to identify someone based on voice. Is that problem already solved? And can we use this to solve the problem of generating someone's voice?

link

alephnil 3396 days ago

That has been possible for years, and is even a typical student assignment in speech processing courses. A quick search gave this example course at Cornell

http://people.ece.cornell.edu/land/courses/ece5760/FinalProj...

link

qq66 3396 days ago

"My voice is my passport."

link

saycheese 3396 days ago

Sneakers (1992): My Voice Is My Passport

https://m.youtube.com/watch?v=-zVgWpVXb64

Verify me.

aahahahaha !

now? https://www.youtube.com/watch?v=XfcqBElF0ZI

link

M4v3R 3396 days ago

Afaik VoCo isn't creating anything from thin air, instead it scans the available voice data (it reportedly needs a sample of about 20 mins of a person speaking) and copies fragments of it in specific order to create a sentence.

link