| HN Mirror

How would you use spectrogram diffs for training?

I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.

Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.

Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.