| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vvolhejn 245 days ago
	Author here. Speech-to-text is more or less solved, it's easy to automatically get captions including precise timestamps. For training Moshi, Kyutai's audio LLM, my colleagues used whisper-timestamped to transcribe 7 million hours of audio. See Section 4.2 in the Moshi paper: https://arxiv.org/pdf/2410.00037

1 comments

Sweet!