|
|
|
|
|
by logicchains
130 days ago
|
|
It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences. |
|
Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.
Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:
This can be computed in linear time as sum(a) * sum(b).Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.