Hacker News new | ask | show | jobs
by truthexposer 3341 days ago
I believe what makes the voices robotic is due to the little amount of audio they need to generate a "usuable" voice from the system.

Speech models usually use triphones, which turns out to be a huge amount of audio. This is particularly impressive because of how little data they need.

Google used their own datasets, which are most likely massive.