Hacker News new | ask | show | jobs
by QuercusMax 229 days ago
Aren't we talking about the auditory quality of the generated vocals? I'm don't understand how you could possibly think the textual training data could possibly impact the perceived vocal strain (which are actually just artifacts) of the generated vocals.
1 comments

Don't they have models that do text-to-speech and maybe even audio/speech-to-text? If so, there is surely text in the datasets, otherwise I'm not sure how they'd accomplish something like that.