|
|
|
|
|
by espadrine
3571 days ago
|
|
I wouldn't. The results they offer are excellent, but the missing points they need to achieve human level are related to producing the correct intonation, which requires accurate understanding of the material. That is still at least ten years in the future, I expect. |
|
A big problem with generating prosody has always been that our theories of it don't really provide a great prediction of people's behaviours. It's also very expensive to get people to do the prosody annotations accurately, using whatever given theory.
Predicting the raw audio directly cuts out this problem. The "theory" of prosody can be left latent, rather than specified explicitly.