Hacker News new | ask | show | jobs
by vvolhejn 245 days ago
Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio.

See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037

1 comments

Sweet!