|
|
|
|
|
by visarga
975 days ago
|
|
Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth. The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability. https://arxiv.org/abs/2309.17453 Another paper says the middle part is lossy while the beginning and end are better attended. |
|