|
|
|
|
|
by mightytravels
936 days ago
|
|
Like how easy it is to get going but you need to download about 20GB and s2st needs 40GB GPU RAM! It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux). Seems like some default text but it runs on full GPU for 10 minutes. Tons of bug reports in GitHub as well. Text Translate works but not sure what is the context length of the model. Seems short at first glance (haven't looked into it). Oh and why is Whisper a dependency? Seems not need if FB has their own model? |
|
> It runs but any audio input (you will need to provide wav not mp3's) I tried (tried 20s/40s/300s) I get just one short sentence returned in target language that seems not related at all to my audio input (i.e. Tous les humains sont créés égaux).
You might want to open an issue on github for that one. The model is made to work on short utterances, if you have a long speech, you'll want to segment it first. I've tried "tous les humains sont créés égaux" on the demo: https://seamless.metademolab.com/expressive (which runs the same code as in the repo) and the output was correct. Maybe there is something wrong going on in the conversion of the input audio?
> Oh and why is Whisper a dependency? Seems not need if FB has their own model?
Whisper is a dependency as it's used as a baseline for evaluation. You can check out the paper for explanations.