|
|
|
|
|
by pdxww
2615 days ago
|
|
Impressive. Would this model benefit from something like "dilated attention"? Instead of feeding it raw sound samples, we could split the input into 16 sec, 8 sec, 4 sec and so on slices, assign each slice a "sound vector" serving as a short description of that slice and let the generator take those sound vectors as input. This should supposedly let it gain global consistency in output. Now an unpopular opinion. I'm not an ML expert, so take my words with reasonable skepticism. This fancy GPT2 model diagram can impress an uninitiated, but we are initiated, right? There is really no science there and it's still the good old numbers grinder: an input of fixed size is passed thru a big random pile of matrix multiplications and sigmoids and yields a fixed size output. We could technically replace this nice looking GTP2 model with a flat stack of matmuls and tanhs, with a ton of weights and given enough powerful GPUs (that would cost tens of millions), train that model and get the same result. It just won't make an impression of science. How are these GTP2 models designed? By somewhat random experiments with the model structure. The key here is the GPU datacenter that could quickly evaluate the model on a huge dataset. The breakthru would be achieving the same quality with very little weights. |
|
I didn’t quite get it. How would you feed this variable sized input?