|
|
|
|
|
by p1esk
2611 days ago
|
|
We can think of a ML model that takes 1 second of sound as input and produces a vector of fixed length that describes this sound You mean like an autoencoder? Ok, assuming we have those sequences (F1, F2, F10, etc), how would you combine them to train the model? |
|
We can combine multiple sequences in any way we want. Obviously, we can come up with some nice looking "tower of lstms" where each level of that tower processes the corresponding F[i] sequence: sequence F1 goes to level T1 which is a bunch of LSTMs; then F2 and the output of T1 go to T2 and so on. The only thing that I think matters is (1) feed all these sequences to the model and (2) have enough weights in the model. And obviously a big GPU farm to run experiments.