|
|
|
|
|
by jfsantos
3405 days ago
|
|
Hi, I'm one of the authors. In broad lines, we pretrained one model (the "Reader") to learn to read text and output vocoder variables, and another model (SampleRNN) to go from these vocoder variables to an audio waveform. Then, we finetuned both models together to be able to go from text to speech, end-to-end. The "end product" is a text-to-speech system, but without the need of having to extract tons of hand-engineered features from the text to be able to generate speech. We also expect that with more training this will be able to overcome the usual vocoder speech "unnaturalness" issues. |
|
Also, I notice that many of the result clips trail off in volume. Is that a processing error or intentional in how the clips are edited?