Hacker News new | ask | show | jobs
by deepsquirrelnet 544 days ago
I read your paper this morning, and am just thrilled with the work. Love the added local attention layers. I’ve experimented with them for years (lucidrains repo), and was always surprised they didn’t go further. Inference speeds are awesome on this model. Scrapping NSP, awesome. Increased masking, awesome. RoPE and longer context, again, bravo. There’s so many great incremental improvements learned over the years and you guys made so many good decisions here.

I’d love to distill a “ModernTinyBERT”, but it seems a bit more complex with the interleaved layers.

1 comments

> I’d love to distill a “ModernTinyBERT

That’s a question I’m interested in as well! DistilBERT and friends have been terribly useful at the edge. I wonder if/when we may see something similar for ModernBERT.