the agentic benchmarks for 3.1 indicate Gemini has caught up. the gains are big ...

kakugawa · 2026-02-19T21:33:10 1771536790

In mid-2024, Anthropic made the deliberate decision to stop chasing benchmarks and focus on practical value. There was a lot of skepticism at the time, but it's proven to be a prescient decision.

girvo · 2026-02-19T21:30:30 1771536630

Benchmarks are basically straight up meaningless at this point in my experience. If they mattered and were the whole story, those Chinese open models would be stomping the competition right now. Instead they're merely decent when you use them in anger for real work.

I'll withhold judgement until I've tried to use it.

phatfish · 2026-02-20T10:41:04 1771584064

Does anyone know what this "APEX-Agents benchmark for long time horizon investment banking, consulting and legal work" actually evaluates?

That sounds so broad that creating a meaningful benchmark is probably as difficult as creating an AI that actually "solves" those domains.

avereveard · 2026-02-19T23:16:55 1771543015

What's your opinion of glm5 if you had a chance to use it

girvo · 2026-02-20T01:28:44 1771550924

I haven’t yet, though I will be this weekend!

metadat · 2026-02-19T22:17:17 1771539437

Ranking Codex 5.2 ahead of plain 5.2 doesn't make sense. Codex is expressly designed for coding tasks. Not systems design, not problem analysis, and definitely not banking, but actually solving specific programming tasks (and it's very, very good at this). GPT 5.2 (non-codex) is better in every other way.

nl · 2026-02-19T23:00:13 1771542013

Codex has been post-trained for coding, including agentic coding tasks.

It's certainly not impossible that the better long-horizon agentic performance in Codex overcomes any deficiencies in outright banking knowledge that Codex 5.2 has vs plain 5.2.

306bobby · 2026-02-19T22:31:09 1771540269

It could be problem specific. There are certain non program things that opus seems better than sonnet at as well

306bobby · 2026-02-19T22:32:08 1771540328

Swapped sonnet and opus on my last reply, oops

blueaquilae · 2026-02-19T21:26:52 1771536412

Marketing team agree with benchmark score...

HardCodedBias · 2026-02-19T20:28:38 1771532918

LOL come on man.

Let's give it a couple of days since no one believes anything from benchmarks, especially from the Gemini team (or Meta).

If we see on HN that people are willing switching their coding environment, we'll know "hot damn they cooked" otherwise this is another wiff by Google.

drivebyhooting · 2026-02-19T22:27:49 1771540069

You can’t put Gemini and Meta in the same sentence. Llama 4 was DOA, and Meta has given up on frontier models. Internally they’re using Claude.

not_ai · 2026-02-19T23:38:50 1771544330

After spending all that money and firing a bunch of people? Is the new group doing anything at this point?

dekhn · 2026-02-20T00:16:19 1771546579

They are busy demonstrating that Mark Zuckerberg has no sense at all.