| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dheera 2202 days ago
	Which generator works the best, qualitatively? I come from a vision/ML background but haven't played with speech at all, so it's completely new to me, and wondering what the state of the art is. I've been wanting to create a TTS of myself so I can take phone calls using headphones and type back what I want to say so that I don't have to yell private information out loud in public locations. Would be nice if during non-COVID times I could sit in a train seat and take phone calls completely silently.

1 comments

audiohermit 2202 days ago

Much of the work in speech synthesis has been about closing the gap in vocoders, which take a generated spectrogram and output a waveform. There's a clear gap between practical online implementations and computational behemoths like WaveNet. As you implied it's hard to quantitatively judge which result is better, papers usually use surveys to judge.

Here's a recent work that has a good comparison of some vocoders: https://wavenode-example.github.io/

Edit: WaveRNN struck a good balance for me in the past but is not shown in the link. Tons of new work coming out though!

link

sdenton4 2202 days ago

WaveRNN (and even slimmer versions, like LPCNet) are great, and run for a tiny fraction of the compute of the original WaveNet. Pruning is also a good way to reduce model sizes.

I'm not sure what's up with the WaveGLOW (17.1M) example in the linked wavenode comparison... The base WaveGLOW sounds reasonable, though. They're also using all female voices, which strikes me as dodgy; lower male voice pitch tracking is often harder to get right, and a bunch of comparisons without getting into harder cases or failure modes makes it seem like they're covering something up.

(I've run into a bunch of comparisons for papers in the past where they clearly just did a bad job of implementing the prior art. There should be a special circle of hell...)

link

audiohermit 2201 days ago

Agreed. I didn't have a better comparison at hand.

I'm looking at you GAN papers.

link