Hacker News new | ask | show | jobs
by lumost 1062 days ago
This a super interesting paper, but in my oppinion - they do not complete their core claim of offering a generalized mathematical structure for attention/recurrence. The specific structure they propose is very interesting and demonstrably efficient computationally - however they do not show that this approach produces similar accuracy as large LLMs.

I’m anxiously awaiting the follow up where someone tries spending 1MM+ on demonstrating this approaches effectiveness in a large language model context.

1 comments

> however they do not show that this approach produces similar accuracy as large LLMs.

I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.

Interestingly, I did see this tweet [0] mentioning a phase shift that occurs in transformers at exactly the scale RetNet stopped at. Probably simply coincidental but I was previously unaware of this phenomenon at such a scale in transformers.

[0] https://twitter.com/gordic_aleksa/status/1682479676910870529

tim dettmers is such a resource, cheers for this