Hacker News new | ask | show | jobs
by toxik 1199 days ago
You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.