Hacker News new | ask | show | jobs
by ppymou 882 days ago
Thanks for reading and yes you are right, the input audios are clips of music;

MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth

[1] https://huggingface.co/datasets/google/MusicCaps