|
|
|
|
|
by sabaimran
314 days ago
|
|
Super excited to see these released! Major points of interest for me: - In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks. - AIME 2025 is nearly saturated with large CoT - CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning. - Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk. Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b... |
|