|
|
|
|
|
by refulgentis
889 days ago
|
|
If author is around: amazing work!!! Multimodal from scratch :) I'm curious if you have the test clip you use, I got to the end and was like "wait....is that a good result! The words are completely different!" Then I re-read a couple times scanning carefully for references to what the audio is. This quote[^1] makes me think the sample is music, as that would explain why the end result is good -- it's trying to describe a sound file of just music, not a sound file that is a spoken word version of the "ground truth": [^1] "For dataset, I chose MusicCaps. I did not see any convenient links to download processed/segmented audio files, so I wrote a small script to download the Youtube videos." |
|
MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth
[1] https://huggingface.co/datasets/google/MusicCaps