"After training, we can sample the network to generate synthetic utterances. At each step during sampling a value is drawn from the probability distribution computed by the network. This value is then fed back into the input and a new prediction for the next step is made. Building up samples one step at a time like this is computationally expensive, but we have found it essential for generating complex, realistic-sounding audio."
Relatively, training is fast (due to parallelism / masking so you don't have to sample during training) but during generation sampling is a sequential process. They talk about it a bit in the previous papers for PixelCNN and PixelRNN.
I was wondering the same. They don't mention anything about how long it took on what kind of system. Even for a first beta it would give us some ballpark idea of how slow it is -- because it's clearly slow, they just keep back how slow exactly, so it's probably bad.
So it looks like generation is a slow process.