Numbers for SWE-bench Verified, Aider Polyglot, cost per million output tokens, ...

anotherpaulg · on April 14, 2025

I just finished updating the aider polyglot leaderboard [0] with GPT-4.1, mini and nano. My results basically agree with OpenAI's published numbers.

Results, with other models for comparison:

    Model                       Score   Cost

    Gemini 2.5 Pro Preview 03-25 72.9%  $ 6.32
    claude-3-7-sonnet-20250219   64.9%  $36.83
    o3-mini (high)               60.4%  $18.16
    Grok 3 Beta                  53.3%  $11.03
  * gpt-4.1                      52.4%  $ 9.86
    Grok 3 Mini Beta (high)      49.3%  $ 0.73
  * gpt-4.1-mini                 32.4%  $ 1.99
    gpt-4o-2024-11-20            18.2%  $ 6.74
  * gpt-4.1-nano                  8.9%  $ 0.43

Aider v0.82.0 is also out with support for these new models [1]. Aider wrote 92% of the code in this release, a tie with v0.78.0 from 3 weeks ago.

[0] https://aider.chat/docs/leaderboards/

[1] https://aider.chat/HISTORY.html

pzo · on April 15, 2025

Did you benchmarked combo: DeepSeek R1 + DeepSeek V3 (0324)? There is combo on 3rd place : DeepSeek R1 + claude-3-5-sonnet-20241022 and also V3 new beating claude 3.5 so in theory R1 + V3 should be even on 2nd place. Just curious if that would be the case

purplerabbit · on April 15, 2025

What model are you personally using in your aider coding? :)

anotherpaulg · on April 15, 2025

Mostly Gemini 2.5 Pro lately.

I get asked this often enough that I have a FAQ entry with automatically updating statistics [0].

  Model               Tokens     Pct

  Gemini 2.5 Pro   4,027,983   88.1%
  Sonnet 3.7         518,708   11.3%
  gpt-4.1-mini        11,775    0.3%
  gpt-4.1             10,687    0.2%

[0] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...

jsnell · on April 14, 2025

https://aider.chat/docs/leaderboards/ shows 73% rather than 69% for Gemini 2.5 Pro?

Looks like they also added the cost of the benchmark run to the leaderboard, which is quite cool. Cost per output token is no longer representative of the actual cost when the number of tokens can vary by an order of magnitude for the same problem just based on how many thinking tokens the model is told to use.

anotherpaulg · on April 14, 2025

Aider author here.

Based on some DMs with the Gemini team, they weren't aware that aider supports a "diff-fenced" edit format. And that it is specifically tuned to work well with Gemini models. So they didn't think to try it when they ran the aider benchmarks internally.

Beyond that, I spend significant energy tuning aider to work well with top models. That is in fact the entire reason for aider's benchmark suite: to quantitatively measure and improve how well aider works with LLMs.

Aider makes various adjustments to how it prompts and interacts with most every top model, to provide the very best possible AI coding results.

BonoboIO · on April 14, 2025

Thank you for providing such amazing tools for us. Aider is a godsend, when working with large codebase to get an overview.

modeless · on April 14, 2025

Thanks, that's interesting info. It seems to me that such tuning, while making Aider more useful, and making the benchmark useful in the specific context of deciding which model to use in Aider itself, reduces the value of the benchmark in evaluating overall model quality for use in other tools or contexts, as people use it for today. Models that get more tuning will outperform models that get less tuning, and existing models will have an advantage over new ones by virtue of already being tuned.

jmtulloss · on April 15, 2025

I think you could argue the other side too... All of these models do better and worse with subtly different prompting that is non-obvious and unintuitive. Anybody using different models for "real work" are going to be tuning their prompts specifically to a model. Aider (without inside knowledge) can't possibly max out a given model's ability, but it can provide a reasonable approximation of what somebody can achieve with some effort.

modeless · on April 14, 2025

There are different scores reported by Google for "diff" and "whole" modes, and the others were "diff" so I chose the "diff" score. Hard to make a real apples-to-apples comparison.

jsnell · on April 14, 2025

The 73% on the current leaderboard is using "diff", not "whole". (Well, diff-fenced, but the difference is just the location of the filename.)

modeless · on April 14, 2025

Huh, seems like Aider made a special mode specifically for Gemini[1] some time after Google's announcement blog post with official performance numbers. Still not sure it makes sense to quote that new score next to the others. In any case Gemini's 69% is the top score even without a special mode.

[1] https://aider.chat/docs/more/edit-formats.html#diff-fenced:~...

jsnell · on April 14, 2025

The mode wasn't added after the announcement, Aider has had it for almost a year: https://aider.chat/HISTORY.html#aider-v0320

This benchmark has an authoritative source of results (the leaderboard), so it seems obvious that it's the number that should be used.

modeless · on April 14, 2025

OK but it was still added specifically to improve Gemini and nobody else on the leaderboard uses it. Google themselves do not use it when they benchmark their own models against others. They use the regular diff mode that everyone else uses. https://blog.google/technology/google-deepmind/gemini-model-...

tcdent · on April 14, 2025

They just pick the best performer out of the built-in modes they offer.

Interesting data point about the models behavior, but even moreso it's a recommendation of which way to configure the model for optimal performance.

I do consider this to be an apple-to-apples benchmark since they're evaluating real world performance.

meetpateltech · on April 14, 2025

Yes, it is available in Cursor[1] and Windsurf[2] as well.

[1] https://twitter.com/cursor_ai/status/1911835651810738406

[2] https://twitter.com/windsurf_ai/status/1911833698825286142

cellwebb · on April 14, 2025

And free on windsurf for a week! Vibe time.

tomjen3 · on April 14, 2025

Its available for free in Windsurf so you can try it out there.

Edit: Now also in Cursor

ilrwbwrkhv · on April 15, 2025

Yup GPT 4.1 isn't good at all compared to the others. I tried a bunch of different scenarios, for me the winners:

Deepseek for general chat and research Claude 3.7 for coding Gemini 2.5 Pro experimental for deep research

In terms of price Deepseek is still absolutely fire!

OpenAI is in trouble honestly.

torginus · on April 15, 2025

One task I do is I feed the models the text of entire books, and ask them various questions about it ('what happened in Chapter 4', 'what did character X do in the book' etc.).

GPT 4.1 is the first model that has provided a human-quality answer to these questions. It seems to be the first model that can follow plotlines, and character motivations accurately.

I'd say since text processing is a very important use case for LLMs, that's quite noteworthy.

soheil · on April 14, 2025

Yes on both Cursor and Windsurf.

https://twitter.com/cursor_ai/status/1911835651810738406