|
|
|
|
|
by gwern
2699 days ago
|
|
Convolutions in a hierarchy of layers, especially with dilated convolutions, provide long-range connections between inputs (handwavily logarithmic). In an RNN, they are separated by however many steps in a linear way, so gradients more easily vanish. Some paper which I do not recall examined them side by side and found that RNNs quickly forget inputs, even with LSTMs, and this means their theoretically unlimited long-range connections between inputs via their hidden state don't wind up being that useful. |
|
So I don't think RNNs are fundamentally more myopic than CNNs (just that there may be practical advantages to using the latter)
Hierarchical RNNs, Clockwork RNNs and Hierarchical Multiscale RNNs and probably others are doing things of this nature.