|
|
|
|
|
by ppymou
882 days ago
|
|
Thanks for reading and yes you are right, the input audios are clips of music; MusicCaps [1] is a dataset containing pairs of music audio and natural language description of the clip; the reason why the result is good imo is because the trained model was able to generate a description with features of the ground truth [1] https://huggingface.co/datasets/google/MusicCaps |
|