| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by woadwarrior01 197 days ago
	It's just a long winded way of saying "tied embeddings"[1]. IIRC, GPT-2, BERT, Gemma 2, Gemma 3, some of the smaller Qwen models and many more architectures use weight tied input/output embeddings. [1]: https://arxiv.org/abs/1608.05859