- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Major points of interest for me:
- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...