| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vagabund 870 days ago
	They're generating the audio. They use a series of techniques to automatically generate metadata for speech samples in LibriSpeech for things like accent, recording quality, pitch, speed, gender, then use an LLM to format these tags into comprehensive natural language descriptions, leading to a more tunable model at inference time. This metadata generation pipeline is the key insight and what was missing from speech datasets unlike e.g. image datasets, which have obviously seen more rapid success.