| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ma2rten 983 days ago

Attention takes in all tokens in the sequence and outputs a new representation of the current token in context. Each layer of the transformer adds more context to the token.

I haven't read this explanation in detail and although they have some nice animations, I wouldn't go to FT to explain machine learning concepts. Here are two well known explanations that might be better:

http://jalammar.github.io/illustrated-transformer/

http://nlp.seas.harvard.edu/annotated-transformer/.

2 comments

lawlessone 983 days ago

So is it analogous to how a CNN starts with fragments of images and further up the chain assembles these into objects?

link

ma2rten 982 days ago

Yes, I think that is a reasonable way to think about it, in my opinion. However, with the language modeling objective it predicts the next token and because of the residual connections each intermediate layer is in the same space. So, maybe it would be more accurate to say that it is an increasingly accurate representation of the next token.

link

jacomoRodriguez 982 days ago

thanks a lot, looking at the links right now and I think they go more in depth :)

link