Hacker News new | ask | show | jobs
by refulgentis 889 days ago
If author is around: amazing work!!! Multimodal from scratch :)

I'm curious if you have the test clip you use, I got to the end and was like "wait....is that a good result! The words are completely different!"

Then I re-read a couple times scanning carefully for references to what the audio is.

This quote[^1] makes me think the sample is music, as that would explain why the end result is good -- it's trying to describe a sound file of just music, not a sound file that is a spoken word version of the "ground truth":

[^1] "For dataset, I chose MusicCaps. I did not see any convenient links to download processed/segmented audio files, so I wrote a small script to download the Youtube videos."

1 comments

Thanks for reading and yes you are right, the input audios are clips of music;

MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth

[1] https://huggingface.co/datasets/google/MusicCaps