Hacker News new | ask | show | jobs
by pippin 2587 days ago
Hi HN, I'm one of the creators from Dessa of this project.

If you haven't listened to it, we just released a longer clip of the RealTalk model[1]. In my opinion it's even more compelling than these clips.

One of fascinating parts of building this has been the questions we received while showing it to people. I'll note a few anecdotes specifically:

"What is the difference between this and a real voice?"

"Can I learn to discern fakes over time?"

"Would we relate differently to a generative voice model posthumously, compared to current media forms like videos?"

These aren't questions that we necessarily have answers to yet, but they're important discussions to have.

[1] https://www.youtube.com/watch?v=DWK_iYBl8cA&feature=youtu.be

1 comments

All of these generated works are very impressive.

I think it is completely irresponsible to advance the state of the art in this field without simultaneously developing techniques to demonstrate that the generated work is artificial.

Please develop validation tests while developing your generative techniques.

Haven't read what they're doing, but chances are they are using an adversial neural network.

The job of the adversarial network is to tell apart real from fake. The job of the neural network is to fool the adverserial network. Both are trained in tandem.

One could imagine training another adverserial network that isn't used to train the network itself, and so will pick up on nuances that the original adversary doesn't pick up on. Anyone could do that, I don't think it's the author's responsibility.

Somewhat related:

https://keenlab.tencent.com/en/2019/03/29/Tencent-Keen-Secur...

Doubt it.

Generative-adversarial models have had a lot of success in image generation; however, the same cannot be said for speech synthesis.

Unless they have figured out a new technique, they are probably using Tacotron 2 (https://ai.googleblog.com/2017/12/tacotron-2-generating-huma...). Google's Tacotron 2 already achieved human-parity TTS without adversarial training as measured by MOS.