Hacker News new | ask | show | jobs
by mightytravels 936 days ago
Like how easy it is to get going but you need to download about 20GB and s2st needs 40GB GPU RAM!

It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

Seems like some default text but it runs on full GPU for 10 minutes. Tons of bug reports in GitHub as well.

Text Translate works but not sure what is the context length of the model. Seems short at first glance (haven't looked into it).

Oh and why is Whisper a dependency? Seems not need if FB has their own model?

1 comments

Hello, I work on seamless.

> It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).

You might want to open an issue on github for that one. The model is made to work on short utterances, if you have a long speech, you'll want to segment it first. I've tried "tous les humains sont créés égaux" on the demo: https://seamless.metademolab.com/expressive (which runs the same code as in the repo) and the output was correct. Maybe there is something wrong going on in the conversion of the input audio?

> Oh and why is Whisper a dependency? Seems not need if FB has their own model?

Whisper is a dependency as it's used as a baseline for evaluation. You can check out the paper for explanations.

I tried as short as 10s and it still provides just something random. How short does the audio need to be? Text works fine but can’t get audio-to-audio to work.