Y
Hacker News
new
|
ask
|
show
|
jobs
by
toxik
1199 days ago
You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.