|
|
|
|
|
by vvolhejn
245 days ago
|
|
Author here. I think it's more of a capability issue than a safety issue. Since learning audio is still harder than learning text, audio models don't generalize as well. To fix that, audio models rely on combining information from text and audio (having a single model that consumes/produces both text and audio tokens) and the audio tokens basically end up being an integrated speech-to-text/text-to-speech. This reflects my colleagues' experience working on Moshi, and it seems to be the case for other models too, see the Conclusion section. Part of the reason can also be synthetic data: if you fine-tune on data generated from text via a text-to-speech, the tone of the voice doesn't have any information, so the model learns to ignore it. |
|