Hacker News new | ask | show | jobs
by jfsantos 3405 days ago
Hi, I'm one of the authors. In broad lines, we pretrained one model (the "Reader") to learn to read text and output vocoder variables, and another model (SampleRNN) to go from these vocoder variables to an audio waveform. Then, we finetuned both models together to be able to go from text to speech, end-to-end. The "end product" is a text-to-speech system, but without the need of having to extract tons of hand-engineered features from the text to be able to generate speech. We also expect that with more training this will be able to overcome the usual vocoder speech "unnaturalness" issues.
1 comments

Can you comment on what's happening in this sample (result???) clip? http://www.josesotelo.com/speechsynthesis/files/wav/blizzard...

Also, I notice that many of the result clips trail off in volume. Is that a processing error or intentional in how the clips are edited?

I think the model just got tired of reading text and decided to mock us :) Just kidding. The attention mechanism got stuck somehow for this sample. This does not happen very often, though. It's important to note the samples we posted were not cherry-picked: they are just the first 10 sentences from our test set.

Regarding the truncation at the end, that was a bug in our sampling code that we just fixed. We will update the samples soon!

Is there any way to artificially induce that failure? I'm an artist and I've been trying to get a handle on ML stuff, and being able to feed speech through this to give it the flat affect of the phoneme-mode samples, or insert attention failures at specific points, would be extremely useful for a number of projects I have in mind.