|
|
|
|
|
by hadsed
2490 days ago
|
|
If you think a longer context length might be helpful consider stacking convolutions to give higher units a bigger receptive field, or try the convolutional LSTM. If that helps and you have a further argument for why an even larger context window would be helpful then perhaps try attention and in that case a Transformer would be reasonable. But your stacked conv net would be the fastest and most obvious thing that should work (with the caveat that I know nothing else about your data and it's characteristics, which is a really big caveat). Consider looking at your errors and judging whether they stem from things your current model doesn't do well but that Transformers do, i.e., correlating two steps in a sequence across a large number of time steps. Attention is basically a memory module, so if you don't need that it's just a waste of compute resources. |
|
> Attention is basically a memory module, so if you don't need that it's just a waste of compute resources.
But aren't CNNs also like a memory module (ie: they memorize how leopard skin looks like)? I guess attention is a more sophisticated kind of memory, "more dynamic" so to speak.
Anyway, I'm glad to hear that a transformer architecture isn't totally stupid for my task, I will look up the literature, there seems to be a bit on this matter.