Hacker News new | ask | show | jobs
by pdxww 2610 days ago
That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.
1 comments

Sure, we can probably find a way to map two similar chunks to two similar vectors. However, with 1:1 mapping the resulting vectors will be just as unique. That's a problem, because, if you recall, we want to predict the next unit of music based on the units the model has seen so far. Training a model for this task requires showing it sequences of encoded units of music (vectors), where we must have many examples of how a particular vector follows a combination of particular vectors. If most of our vectors are unique, we won't have enough examples to train the model. For example, showing the model multiple examples of a phrase "I'm going to [some verb]", it will eventually learn that "to" after "I'm going" is quite likely, that a verb is more likely after "to" than an adjective, etc. This wouldn't have happened if the model saw 'going' or 'to' only once during training.
Can we diff spectrograms to define the "distance" between two chunks of sound and use this measure to guide the ML learning process?

Would it help to decompose sound into subpatterns with Fourier transform?

Afaik, there is a similar technique for recognizing faces: a face picture is mapped to a "face vector". Yet this technique doesn't need the notion of "sequence of faces" to train the model. Can we use it to get "sound vectors"?

How would you use spectrogram diffs for training?

I'm not sure what would be useful "subpatterns" of sound. In language modeling, there are word based, and character based models. Given enough text, an RNN can be trained on either, and I'm not sure which approach is better. For music the closest equivalent of a word is (probably) a chord, and the closest equivalent of a character is (probably) a single note, but perhaps it should be something like a harmonic, I don't know.

Unlike faces, music is a sequence (of sounds). It's closer to video than to an image. So we need to chop it up and to encode each chunk.

Ultimately, I believe that we just need a lot of data. Given enough data, we can train a model which is large enough to learn everything it needs in the end to end fashion. Primary achievement of GPT-2 paper is training a big model on lots of data. In this work, it appears they only used a couple of available midi datasets for training, which is probably not enough. Training on all available audio recordings (either raw, or converted to symbolic format) would probably be a game changer.