|
|
|
|
|
by querez
547 days ago
|
|
Two questions: 1) Going by the Runtime vs GLUE graph, the ModernBERT-Base is roughly as fast as BERT-BAse. Given its architecture (especially Alternating Attention), I'm curious why the model not considerably faster than its predecessor. Any insight you could share on that? 2) Most modern LLMs are Encoder+Decoder model. Why not chop of the decoder of one of these (e.g. a small Llama or Mistral or other liberally-licensed model) and train a short head on top? |
|
For (1), it's because BERT has both noticeably fewer parameters, and we're comparing at short context length (in the interest of providing a broader comparison), so local attention is a lot impactful than it is at the longer context lengths.
For (2), most LLMs are actually decoder-only, so there is no "encoder" here. But also, there's not a lot of LLMs in the ±100M parameter range in the first place!