It's really interesting how much the AI harness seems to matter. Going from 48% via Google's official results to 65% is a huge jump. I feel like I'm constantly seeing results that compare models and rarely seeing results that compare harnesses.
Is there a leaderboard out there comparing harness results using the same models?
Maybe the future isn't a human-like centralized intelligence but an octopus-like decentralized intelligence where more focus is placed on making the harness itself "smart"
Not really. Anthropic for example sells both the harness and the models as a unified kit via Claude Code, it is in their best interest to make sure both parts work as well as possible, via reinforcement learning of previous usage as well for new model performance increases.
That's not true that anyone can write a good harness because the LLM providers have information like prompts that they can RL train off of that someone writing their own harness would not have. Therefore a good and proprietary harness is a moat.
Because it's a way to make more money in the future. I feel like you're not really getting the difference between what a business does for profit and its technical decisions.
For my local tests the past few months on the same local model, I’ve found Claude Code to be way better than OpenCode, and OpenCode to be better than Codex.
Is there a leaderboard out there comparing harness results using the same models?