|
|
|
|
|
by pdxww
2609 days ago
|
|
Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process? Would it help to decompose sound into subpatterns with Fourier transform? Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"? |
|
I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.
Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.
Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.