Hacker News new | ask | show | jobs
by visarga 975 days ago
Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

2 comments

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)
That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?
Just reading a couple papers every day, the most interesting ones, and following up on reddit and twitter to get notified what people are talking about. And I am directly interested in long-context LLMs for a work related task.

I have also been dabbling with neural nets (pre-transformer), especially LSTM which have a "residual" connection, the one I was mentioning. That makes gradients better behaved. Schmidhuber tech.

Not to denigrate the person you’re responding to, but to add some context: That paper got a decent amount of attention already. Probably one of the more notable in the literature over the last month. Plus compared to the past year everything is slow now.