| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by modeless 3401 days ago

> you can make anyone say anything as long as you have some previous recordings of their voice.

That's not what this is doing. They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection. Generating correct inflection is the hardest part of speech synthesis because doing it perfectly requires a complete understanding of the meaning of the text.

The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing. And even in that case, it doesn't always sound good. The fourth one is practically unintelligible. But it's interesting because it demonstrates an upper bound on the quality of the voice synthesis possible with their system given perfect inflection as input.

To clarify, this is cool work, the real-time aspect sounds great, and I'm sure it will lead to even more impressive results in the future. But I don't want people to think that all of the clips on this page represent their current text-to-speech quality.

4 comments

PieSquared 3401 days ago

Thank you for clarifying this! We tried fairly hard to make this clear, because as you say, the hard part is generating inflection and duration that sounds natural. There's still a ton of work left to do in this duration – we're clearly nowhere near being able to generate human-level speech.

Our work is meant to make working with TTS easier to deep learning researchers by describing a complete and trainable system that can be trained completely from data, and demonstrate that the neural vocoder substitutes can actually be deployed to streaming production servers. Future work (both by us and hopefully other groups) will make further progress for inflection synthesis!

link

mrmaximus 3400 days ago

My "Fake News" comment aside, I think what y'all are doing could be transformational for many reasons. Imagine a scenario where a person loses a loved one, and similar technology is able to allow them to "have conversations" with the deceased as a form of healing and closure. Not to mention, this could add a personal touch to assistant bots that will make them a pleasure to use.

link

mrmaximus 3401 days ago

>The top two are representative of what it sounds like when doing true text to speech. The middle five are just resynthesis of a clip saying the exact same thing.

Gotcha, now I understand.

link

phkahler 3400 days ago

>> They're simply resynthesizing exactly what the person said, in the same voice. It's essentially cheating because they can use the real person's inflection.

Yes, but imagine being able to take sound from one person and inflection from another. If you want to fake someone saying something you don't need to do pure TTS, a human can be used to fake another persons inflections.

link

mrmaximus 3401 days ago

Based upon what little is posted there, I thought they were taking the original recording, then training the model on that recording against the text of the recording... reproducing the recording. I would think next step is to sample enough audio and text to be able to produce new outputs entirely. It should in theory even be able to learn when/where/how to use inflection.

link