|
|
|
|
|
by petrochukm
2580 days ago
|
|
For context, it's important to know that these are probably cherry picked samples. The authors make no mention of attempting randomly select these samples. For as long as text-to-speech has existed, there have been impressive demos backed by cherry picking. The 3 Dessa team members did not in 3 months of work create anything innovative probably. Rayhane Mamah, one of the Dessa team members, had previously published a Tacotron 2 (Google's 2017 research) implementation (https://github.com/Rayhane-mamah/Tacotron-2) that has similar noise/distortion and intonation/prosody issues as their "RealTalk model". Following on the above, Google's TTS research already demonstrated human-parity as measured by MOS score in early 2018. That research was deployed as Google Duplex in mid 2018. Google's TTS research also showed the deficiencies of this technology. Without the invention of AGI, the TTS models do not understand the underlying text; therefore, it'll be unable to do more "complex things with intonation/prosody". Furthermore, the models suffer from overfitting. The model performance degrades significantly when performing TTS on text not typically seen in the training data. |
|