|
|
|
|
|
by mrmaximus
3392 days ago
|
|
Interesting. They are not TTS like we are accustomed to, they are replicating a specific persons voice with TTS. Listen to the ground-truth recordings at the bottom and then the synthesized versions above. "Fake News" is about to get a lot more compelling when you can make anyone say anything as long as you have some previous recordings of their voice. |
|
That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.
The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.
To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.