| Skimming it, there are a few things about this explanation that rub me just slightly the wrong way. 1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such. 2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess. 3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes. 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings. 5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect. 6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component. 7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work. 8. Nothing about the residual |
Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition