Y
Hacker News
new
|
ask
|
show
|
jobs
by
valine
385 days ago
Attention computes a weighted average of all previous latents. So yes, it’s a new token as input to the forward pass, but after it feeds through an attention head it contains a little bit of every previous latent.