Larger model, better benchmarks. Bigger bomb more yield. Any benchmarks where we...

omcnoe · 2026-04-07T20:52:40 1775595160

Yes - eg. page 192 BrowseComp bunchmark.

Mythos preview has higher accuracy with fewer tokens used than any previous Claude model. Though, the fact that this incredibly strong result was only presented for BrowseComp (a kind of weird benchmark about searching for hard to find information on the internet) and not for the other benchmarks implies that this result is likely not the same for those other benchmarks.

neolefty · 2026-04-07T22:35:52 1775601352

Also https://arcprize.org/arc-agi/3 — scored (at least in part?) based on power used.