|
|
|
|
|
by lumost
1062 days ago
|
|
This a super interesting paper, but in my oppinion - they do not complete their core claim of offering a generalized mathematical structure for attention/recurrence. The specific structure they propose is very interesting and demonstrably efficient computationally - however they do not show that this approach produces similar accuracy as large LLMs. I’m anxiously awaiting the follow up where someone tries spending 1MM+ on demonstrating this approaches effectiveness in a large language model context. |
|
I think they have demonstrated their case pretty well, unless there is some serious degradation of the scaling - 7b is pretty big.