|
|
|
|
|
by syllogism
3578 days ago
|
|
Not really. They're training directly on the waveform, so the model can learn intonation. They just need to train on longer samples, and perhaps augment their linguistic representation with some extra discourse analysis. A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory. Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly. |
|