Hacker News new | ask | show | jobs
by causal 762 days ago
So most of my understanding comes from this series, particularly the last two videos: https://www.3blue1brown.com/topics/neural-networks

Essentially each token of a text occupies a point in a many-dimensional model that represents meaning, and LLMs predict the next token by modifying the last token with the context of all the tokens before it. Attention heads are basically a way of choosing which prior tokens are most relevant and adjusting the last token's point in vector-space accordingly.