Hacker News new | ask | show | jobs
by bmc7505 3404 days ago
Hi Andrew, congratulations on your result! A few questions, feel free to answer one or any. How close do you think you are to having fully end-to-end models for speech? Are you optimistic we can get speech synthesis to run on mobile devices in the near future? Do the inference optimizations (particularly sample embedding and layer inference) generalize well to other architectures, like speech recognition? It seems that if these models are going to run offline in realtime on mobile devices, we will need to have specialized hardware, but maybe we can squeeze enough performance out of mobile CPUs to get a highly optimized version to work. Thanks!
2 comments

Thank you!

For fully end-to-end models, it's hard to say exactly. The Char2Wav paper demonstrates that there is hypothetically an architecture and a set of weights that can do synthesis end-to-end, but we cannot yet train such a system. On Reddit, one of the Char2Wav authors comments that they tried training it directly and didn't get great results, and at SVAIL we've also had some trouble doing so. I think it is very likely going to happen in the next several months or year, but we don't yet know exactly what needs to happen in order to get it to work.

As for inference, some of the inference optimizations do generalize. In fact, the GPU optimizations (persistent kernels) were originally developed by our systems team, and published in the Persistent RNN [0] paper. (This is a really powerful technique that CUDA makes very hard to implement, and I have a massive amount of respect for the folks who managed to make it work!) Persistent RNNs make training at close-to-peak-FLOPs with very low batch sizes plausible, and make GPU WaveNet inference plausible. At the moment, our CPU kernels are much more promising, but we don't know whether that will stay the case. For mobile, I think it is possible to get the current systems to work on fairly powerful mobile CPUs with a bunch more work into optimization and low-level assembly, but we haven't done it yet so time will tell.

[0] https://svail.github.io/persistent_rnns/ and http://jmlr.org/proceedings/papers/v48/diamos16.pdf

>> Are you optimistic we can get speech synthesis to run on mobile devices in the near future?

You mean high quality right? I mean speech synth has been around for decades that can run on cheap hardware and is understandable. Speech recognition has also been around for a long time, but there's a huge difference in usability between "pretty good recognition" and "pretty good synthesis". One is useful, the other not so much.