Hacker News new | ask | show | jobs
by yorwba 546 days ago
Llama and Mistral are decoder-only models; there is no encoder you could put a head on.

You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.