| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by imtringued 294 days ago
	I personally would rather recommend people to just look at these architectural diagrams [0] and try to understand them. There is the caveat that they do not show how attention works. For that you need to understand softmax(QK^T)V and multi head attention being a repetition of this multiple times. GQA, MHA, etc just messes around with reusing Q or K or V in clever ways. [0] https://huggingface.co/blog/vtabbott/mixtral