| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pdxww 2616 days ago
	To illustrate more this idea, let's use soundtrack v=negh-3hi1vE on youtube. Such soundtracks consist of multiple more or less repeating patterns. The period of each pattern is different: some background pattern that sets the mood of the music may have a long period of tens of seconds. The primary pattern that's playing right now has a short period of 0.25 seconds, plays for a few seconds and then fades off. The idea is to split the soundtrack into 10 second chunks and map each chunk to a vector of a fixed size, say 128. The same thing we do with words. Now we have a sequence of shape (?, 128) that can be theoretically fed into a music generator and as long as we can map such vectors back to 10 second sound chunks, we can generate music. Then we introduce a similar sequence that splits the soundtrack into 5 second chunks. Then another sequence for 2.5 seconds chunks and so on. Now we have multiple sequences that we can feed to the generator. Currently we take 1/48000th second slices and map them to vectors, but that's about as good as trying to generate meaningful text by drawing it pixel by pixel (which we can surely do and the model will have 250 billion weights and take 2 million years to train on commodity hardware).

1 comments

p1esk 2616 days ago

How would you map these chunks to vectors?

link

pdxww 2616 days ago

The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.

link

p1esk 2615 days ago

What would be an equivalent of a word for music?

link

pdxww 2615 days ago

1 second of sound. Or a few seconds of sound.

link

p1esk 2615 days ago

This would rule out such common mapping methods as word2vec, because unlike words, vast majority of 1 sec chunks of audio would be unique (or only repeating within a single recording).

link

pdxww 2615 days ago

That's fine. The goal is to map "similar" 1 second chunks to similar vectors. I'm sure this can be done and uniqueness of sound won't be a problem.

link