Hacker News new | ask | show | jobs
by ma2rten 983 days ago
Attention takes in all tokens in the sequence and outputs a new representation of the current token in context. Each layer of the transformer adds more context to the token.

I haven't read this explanation in detail and although they have some nice animations, I wouldn't go to FT to explain machine learning concepts. Here are two well known explanations that might be better:

http://jalammar.github.io/illustrated-transformer/

http://nlp.seas.harvard.edu/annotated-transformer/.

2 comments

So is it analogous to how a CNN starts with fragments of images and further up the chain assembles these into objects?
Yes, I think that is a reasonable way to think about it, in my opinion. However, with the language modeling objective it predicts the next token and because of the residual connections each intermediate layer is in the same space. So, maybe it would be more accurate to say that it is an increasingly accurate representation of the next token.
thanks a lot, looking at the links right now and I think they go more in depth :)