| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 546 days ago
	Llama and Mistral are decoder-only models; there is no encoder you could put a head on. You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.