Hacker News new | ask | show | jobs
by thorum 902 days ago
My experience with other tools like xtts is you really need to have a studio-quality voice sample to get the best results.
2 comments

The most obvious problem to my ears is the syllable timing and inflection of the generated speech, and, intuitively, this doesn’t seem like a recording quality issue. It’s as if it did a mostly credible job of emulating the speaker trying to talk like a robot.
The biggest trip-up is the pronunciation of "prototypically", and you had "typically" in your original. Maybe it's overfitting to a stilted proto-typically? Could try with a different, less similar sentence
That might be the next big contribution – performance in perceptually catching the features of a not-so-good recording – for example, with a webcam style microphone.