| HN Mirror

Sure, knowing the basics of LLM math is necessary. But it's also _enough_ to know this math to fully grasp the code. There are only 4 concepts - attention, feed-forward net, RMS-normalization and rotary embeddings - organized into a clear structure.

Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.

It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.

[1]: https://github.com/huggingface/transformers/blob/main/src/tr...