| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by QuercusMax 229 days ago
	Aren't we talking about the auditory quality of the generated vocals? I'm don't understand how you could possibly think the textual training data could possibly impact the perceived vocal strain (which are actually just artifacts) of the generated vocals.

1 comments

embedding-shape 229 days ago

Don't they have models that do text-to-speech and maybe even audio/speech-to-text? If so, there is surely text in the datasets, otherwise I'm not sure how they'd accomplish something like that.

link