|
|
|
|
|
by pdxww
2616 days ago
|
|
To illustrate more this idea, let's use soundtrack v=negh-3hi1vE on youtube. Such soundtracks consist of multiple more or less repeating patterns. The period of each pattern is different: some background pattern that sets the mood of the music may have a long period of tens of seconds. The primary pattern that's playing right now has a short period of 0.25 seconds, plays for a few seconds and then fades off. The idea is to split the soundtrack into 10 second chunks and map each chunk to a vector of a fixed size, say 128. The same thing we do with words. Now we have a sequence of shape (?, 128) that can be theoretically fed into a music generator and as long as we can map such vectors back to 10 second sound chunks, we can generate music. Then we introduce a similar sequence that splits the soundtrack into 5 second chunks. Then another sequence for 2.5 seconds chunks and so on. Now we have multiple sequences that we can feed to the generator. Currently we take 1/48000th second slices and map them to vectors, but that's about as good as trying to generate meaningful text by drawing it pixel by pixel (which we can surely do and the model will have 250 billion weights and take 2 million years to train on commodity hardware). |
|