| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ppymou 929 days ago

Thanks for reading and yes you are right, the input audios are clips of music;

MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth

[1] https://huggingface.co/datasets/google/MusicCaps