Hacker News new | ask | show | jobs
by notpublic 481 days ago
please do explain why
1 comments

tl;dr the base ModernBERT was trained with code in mind unlike most encoder-only models (therefore assuming it was also trained on JSON/YAML objects) and also includes a custom tokenizer to support that, which is why I mention that indentation is important since different levels of indentation have different single tokens.

This is mostly theoetical and does require a deeper dive to confirm.