|
|
|
|
|
by vagabund
870 days ago
|
|
They're generating the audio. They use a series of techniques to automatically generate metadata for speech samples in LibriSpeech for things like accent, recording quality, pitch, speed, gender, then use an LLM to format these tags into comprehensive natural language descriptions, leading to a more tunable model at inference time. This metadata generation pipeline is the key insight and what was missing from speech datasets unlike e.g. image datasets, which have obviously seen more rapid success. |
|