|
|
|
|
|
by vvolhejn
245 days ago
|
|
Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio. See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037 |
|