Why the Same Model Performs Differently

An interesting benchmark has been floating around recently.

Opus 4.6 scored 77% inside Claude Code. The exact same model scored 93% inside Cursor. It was the same model, but a different result. And the only thing that changed was the environment around it.

That environment has a name. It's called a harness. And if you're deploying AI agents without understanding what they are, I’d say you're flying blind.

What a Harness Is And Why It Matters

An AI model, at its core, only does one thing: takes text in and produces text out. That's it. Left to its own devices, it cannot read your files, run commands, edit code, or touch your database. It generates text, that's the whole job.

So, how does Claude Code rewrite a codebase? How does an agent book a meeting or update a CRM?

Tool calls. The model outputs a piece of syntax — essentially "run this command" — and then stops. The harness, a piece of software running around the model, picks that up, executes the command, takes the result, adds it back to the conversation history, and sends everything back to the model to continue. That loop — model asks, harness executes, result feeds back — is running hundreds of times every time you use any agentic tool.

Stanford researcher Mihail Eric made this concrete with an article that circulated widely this year. His argument: the core of Claude Code is not magic. It is 200 lines of Python. Three tools — read file, list files, edit file — a system prompt, and a loop. That is the whole architecture.

What Cursor did was spend thousands of engineering hours on those prompts and tool descriptions. They have people whose entire job is to update the system prompt every time a new model ships — testing obsessively, adjusting descriptions, steering the model away from bad habits. That investment is audible in the benchmark. 16 percentage points on the same model.

However, Anthropic's own engineering team found something that should make every builder uncomfortable. Harness assumptions go stale as models improve. Context anxiety that required full resets in Sonnet 4.5 simply disappeared in Opus 4.5. If you over-engineer control flow, the next model update breaks your system. Manus refactored their harness five times in six months. LangChain is rebuilt four times a year. Vercel removed 80% of its agents’ tools, and performance went up.

The harness is the product. When your agent drifts off-task, stops following instructions mid-workflow, or makes redundant tool calls, that is a harness problem.
More context makes models dumber. When Sonnet 4.6 crosses roughly 50,000-100,000 tokens in its context window, accuracy drops sharply. Stuffing your entire codebase in is not the solution; it is the problem. Good harnesses feed the model what it needs when it needs it.
Your AGENTS.md file is a harness input. What it does is front-load context so the model doesn't spend tool calls discovering it. The fewer unnecessary tool calls your agent makes, the faster and more reliably it works. The context you give it upfront is context it doesn't have to find.

What This Means For You:

If you are deploying AI agents inside your organization, the first question worth asking is not "which model?" It is "What is our harness?"

Most teams don't have a real answer. They have a prompt, maybe a framework, and a hope that the model figures the rest out. That holds for demos. It does not hold across long workflows, multiple users, and real-world edge cases.

The gap between teams with mature harnesses and teams without one is still wide open. The companies that close it first will have agents running reliably when everyone else is debugging why theirs stopped at step 80.

Clutch. Just launched.

OpenClaw made it easy to get an agent running. Clutch makes it safe to run that agent at work.

Secure multi-agent deployment, built for teams that need more than a single-machine setup. We just launched.

Request a demo.

Stanford's 2026 AI Index just dropped, and here’s what they say.
Coding benchmark performance went from 60% to near 100% of the human baseline in a single year. Generative AI reached 53% global population adoption faster than the PC or the internet. And the US-China gap on model performance has narrowed to 2.7 percentage points. The US ranked 24th globally in per-capita AI adoption, and only 33% of Americans expect AI to make their jobs better.
PwC surveyed 1,200 executives across 25 industries. Here’s what they found.
The leaders are using AI to pursue new revenue and reinvent business models instead of just cutting costs. They are nearly twice as likely to run agents in autonomous, self-optimising modes. The gap is widening. PwC is explicit: 2026 is the year the divide between leaders and laggards becomes durable rather than correctable.
Revolut just launched an AI assistant for 13 million UK customers.
The assistant, called AIR, replaces Revolut's entire menu navigation with a conversational interface covering spending, investments, subscriptions, and card controls. Underneath it is PRAGMA, a foundation model Revolut built internally on data from 70 million users, handling fraud detection, credit scoring, and customer support. 80% of support tickets now resolve without a human.

A lot of organizations are frustrated that their AI agents aren't living up to expectations. The model does what they ask in a demo, then falls apart in production after twenty minutes of real work.

Every single time, when we dig in, it's the same thing. They picked a model. They wrote a prompt. They shipped. Nobody had built the harness.

The benchmark I opened with is the clearest illustration I've seen of why this matters. You can get a 16-point performance improvement on the same model just by improving the environment around it. Not a new model. Not a bigger context window. Just better infrastructure.

Most organizations are leaving that on the table. The ones that won't are the ones investing in the boring work of building, testing, breaking, and rebuilding the layer the model runs inside.

Haroon

P.S. If you're starting to think seriously about harness infrastructure for agents running at team scale, Clutch is worth a look.

Why the Same Model Performs Differently

What a Harness Is And Why It Matters

Clutch. Just launched.

Reply

Keep Reading

The AI Ready Newsletter