| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kristopolous 975 days ago
	I'm still very much learning this stuff, but I wonder if that's related to the vanishing gradient problem, which seems to be a fundamental aspect of these types of approaches. (Please don't assume that's correct) https://en.wikipedia.org/wiki/Vanishing_gradient_problem

2 comments

visarga 975 days ago

Vanishing gradient was an issue for non-residual deep networks and vanilla RNNs. While the long context memory issues are along sequence dimension, not network depth.

The problem could be some kind of instability of attention as it scales above 10k tokens. A recent paper suggests attention mechanism needs a default value (a "sink"), and its absence produces instability.

https://arxiv.org/abs/2309.17453

Another paper says the middle part is lossy while the beginning and end are better attended.

link

sandkoan 975 days ago

For anyone who's curious, the paper in question, entitled, "Lost in the Middle: How Language Models Use Long Contexts" (https://arxiv.org/abs/2307.03172)

link

kristopolous 975 days ago

That's a really recent paper. Do you actually keep up to date with everything? How do you find the time?

link

visarga 975 days ago

Just reading a couple papers every day, the most interesting ones, and following up on reddit and twitter to get notified what people are talking about. And I am directly interested in long-context LLMs for a work related task.

I have also been dabbling with neural nets (pre-transformer), especially LSTM which have a "residual" connection, the one I was mentioning. That makes gradients better behaved. Schmidhuber tech.

link

totoglazer 975 days ago

Not to denigrate the person you’re responding to, but to add some context: That paper got a decent amount of attention already. Probably one of the more notable in the literature over the last month. Plus compared to the past year everything is slow now.

link

amelius 975 days ago

Regarding the vanishing gradient problem, has anyone tried to train using only a randomly chosen set of independent parameters in each iteration? (Updating only the weights in a small random independent set).

link

jdthedisciple 975 days ago

Are you referring to Regularization?

https://www.kaggle.com/code/sid321axn/regularization-techniq...

link