| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by querez 547 days ago

Two questions:

1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that?

2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top?

5 comments

bclavie 547 days ago

Hey, Ben here, one of the paper's core author authors. The responses you got were mostly spot on.

For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.

For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!

link

cubie 547 days ago

Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.

link

janalsncm 547 days ago

On your second point, most modern LLMs are decoder only. And as for why adding a classification head isn’t optimal, the decoders you’re referring to have 10x the parameters, and aren’t trained on encoder-type tasks like MLM. So there’s no advantage on any dimension really.

link

yorwba 547 days ago

Llama and Mistral are decoder-only models; there is no encoder you could put a head on.

You could put it on the decoder instead, but then you have the problem that in the causal language-modeling setting that the model was trained for, every token can only attend to preceding tokens and is blind to subsequent ones.

link

spott 547 days ago

ModernBERT-Base is larger than BERT-Base by 39M parameters.

link