| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lumost 1062 days ago
	This a super interesting paper, but in my oppinion - they do not complete their core claim of offering a generalized mathematical structure for attention/recurrence. The specific structure they propose is very interesting and demonstrably efficient computationally - however they do not show that this approach produces similar accuracy as large LLMs. I’m anxiously awaiting the follow up where someone tries spending 1MM+ on demonstrating this approaches effectiveness in a large language model context.

1 comments

whimsicalism 1062 days ago

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.

link

turingfeel 1062 days ago

Interestingly, I did see this tweet [0] mentioning a phase shift that occurs in transformers at exactly the scale RetNet stopped at. Probably simply coincidental but I was previously unaware of this phenomenon at such a scale in transformers.

[0] https://twitter.com/gordic_aleksa/status/1682479676910870529

link

whimsicalism 1062 days ago

tim dettmers is such a resource, cheers for this

link