Hacker News new | ask | show | jobs
by fastoptimizer 3571 days ago
Do they say how much time is the generation taking?

Is this insanely slow to train but extremely fast to do generation?

4 comments

"After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio."

So it looks like generation is a slow process.

Relatively, training is fast (due to parallelism / masking so you don't have to sample during training) but during generation sampling is a sequential process. They talk about it a bit in the previous papers for PixelCNN and PixelRNN.
According to 3rd hand reports I've heard (apply copious amounts of salt), it may take 1 hour of CPU time to generate 1 second of speech.
I was wondering the same. They don't mention anything about how long it took on what kind of system. Even for a first beta it would give us some ballpark idea of how slow it is -- because it's clearly slow, they just keep back how slow exactly, so it's probably bad.