|
We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound: S[0..n] = the raw input, 48000 bytes per second of sound
F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector".
F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general. This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc. Masterpieces have the same multilayered structure, except that those subpatterns are more complex. |
You mean like an autoencoder?
Ok, assuming we have those sequences (F1, F2, F10, etc), how would you combine them to train the model?