Y
Hacker News
new
|
ask
|
show
|
jobs
by
impossiblefork
504 days ago
They also use some kind of factorized attention that somehow leads to compression of tokens (I still haven't read their papers, so I can't be clearer than this).