| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pdxww 2615 days ago
	Impressive. Would this model benefit from something like "dilated attention"? Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input. This should supposedly let it gain global consistency in output. Now an unpopular opinion. I'm not an ML expert, so take my words with reasonable skepticism. This fancy GPT2 model diagram can impress an uninitiated, but we are initiated, right? There is really no science there and it's still the good old numbers grinder: an input of fixed size is passed thru a big random pile of matrix multiplications and sigmoids and yields a fixed size output. We could technically replace this nice looking GTP2 model with a flat stack of matmuls and tanhs, with a ton of weights and given enough powerful GPUs (that would cost tens of millions), train that model and get the same result. It just won't make an impression of science. How are these GTP2 models designed? By somewhat random experiments with the model structure. The key here is the GPU datacenter that could quickly evaluate the model on a huge dataset. The breakthru would be achieving the same quality with very little weights.

1 comments

p1esk 2614 days ago

Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input.

I didn’t quite get it. How would you feed this variable sized input?

link

pdxww 2614 days ago

To illustrate more this idea, let's use soundtrack v=negh-3hi1vE on youtube. Such soundtracks consist of multiple more or less repeating patterns. The period of each pattern is different: some background pattern that sets the mood of the music may have a long period of tens of seconds. The primary pattern that's playing right now has a short period of 0.25 seconds, plays for a few seconds and then fades off. The idea is to split the soundtrack into 10 second chunks and map each chunk to a vector of a fixed size, say 128. The same thing we do with words. Now we have a sequence of shape (?, 128) that can be theoretically fed into a music generator and as long as we can map such vectors back to 10 second sound chunks, we can generate music. Then we introduce a similar sequence that splits the soundtrack into 5 second chunks. Then another sequence for 2.5 seconds chunks and so on. Now we have multiple sequences that we can feed to the generator. Currently we take 1/48000th second slices and map them to vectors, but that's about as good as trying to generate meaningful text by drawing it pixel by pixel (which we can surely do and the model will have 250 billion weights and take 2 million years to train on commodity hardware).

link

p1esk 2614 days ago

How would you map these chunks to vectors?

link

pdxww 2614 days ago

The same way we map words to vectors or entire pictures to vectors. We'll have another ML model that would take 1 second of sound as input (48000 1 byte numbers) and produce a say vector of 128 float32 numbers that would "describe" this 1 second of sound.

link

p1esk 2613 days ago

What would be an equivalent of a word for music?

link

pdxww 2613 days ago

1 second of sound. Or a few seconds of sound.

link

uuwp 2614 days ago

The same way we feed the variable size sequence of characters or sound samples into this RNN. Instead of raw samples at the 16 kHz rate, we'll have one sequence of 1 sample per second, another sequence of 1 sample per 0.5 seconds and so on. We can go as far as 1 sample per 1/48000 sec, but I don't think it's practical (but this is what these music generators do).

link

p1esk 2614 days ago

What do you mean by “sample” when you say “sequence of 1 sample per second”?

link

pdxww 2614 days ago

We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound:

S[0..n] = the raw input, 48000 bytes per second of sound F[1][k..k+48000] -> [0..255], maps 1 second of sound to a "sound vector". F[2][k..k+96000] -> ..., same, but takes 2 seconds of sound as input

Now instead of the raw input S, we can use the sequences F[1], F[2], etc. Supposedly, F[10] would detect patterns that change every 10 seconds. It's common in soundtracks to have some background "mood" melody that changes a bit every 10-15 seconds, then a more loud and faster melody that changes every 5 seconds and so on, up to some very frequent patterns like F[0.2] that's used in drum'n'bass or electronic music in general.

This is how music is composed by people, I guess. Most of the electronic music can be decomposed into 5-6 patterns that repeat with almost mathematical precision. The artist only randomly changes params of each layer during the soundtrack, e.g. layer #3 with a period of 7 seconds slightly changes frequency for the next 20 seconds, etc.

Masterpieces have the same multilayered structure, except that those subpatterns are more complex.

link

p1esk 2613 days ago

We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound

You mean like an autoencoder?

Ok, assuming we have those sequences (F1, F2, F10, etc), how would you combine them to train the model?

link

pdxww 2613 days ago

I'm not an ML guy, so can't say if this is an autoencoder.

We can combine multiple sequences in any way we want. Obviously, we can come up with some nice looking "tower of lstms" where each level of that tower processes the corresponding F[i] sequence: sequence F1 goes to level T1 which is a bunch of LSTMs; then F2 and the output of T1 go to T2 and so on. The only thing that I think matters is (1) feed all these sequences to the model and (2) have enough weights in the model. And obviously a big GPU farm to run experiments.

link