Hacker News new | ask | show | jobs
by impossiblefork 504 days ago
They also use some kind of factorized attention that somehow leads to compression of tokens (I still haven't read their papers, so I can't be clearer than this).