Why didn't this author compare Llama 3 with GLM 5.2 (released 1 week ago) which is a more standard attention based LLM? To compare 2 separate families of LLMs and then pointing out that they are different is not a surprising result and detracts from the point the author is trying to make.
If you look at it, the diagrams are very similar, but the main differences are that the feedforward is replaced with a MoE (router to multiple feedforwards) and the model has a different attention implementation.
https://sebastianraschka.com/llm-architecture-gallery/?compa...
If you look at it, the diagrams are very similar, but the main differences are that the feedforward is replaced with a MoE (router to multiple feedforwards) and the model has a different attention implementation.