Why isn't AI getting uniformly cheaper?

Commodity AI is getting cheaper, but frontier AI is not. On GitHub Copilot, output from a lightweight model costs about $1.25 per million tokens, while the most capable model costs $30.00 — a ratio approaching 24:1 within a single vendor. GitHub's shift from flat-rate billing to per-token metering in June 2026 makes this divergence directly visible for the first time. The subsidy that previously masked the real cost of frontier inference is ending.

What did Anthropic's Project Deal experiment reveal about AI model quality?

In December 2025, Anthropic ran a classified marketplace for 69 employees where AI agents handled all buying and selling autonomously. Agents running on the frontier model (Claude Opus 4.5) earned sellers roughly $2.68 more per item and saved buyers about $2.45 per purchase compared to agents running on the lightweight model (Claude Haiku 4.5). Critically, participants with weaker models reported similar satisfaction and did not realise their agents had underperformed.

How does the AI cost gap affect businesses and governments?

Organisations paying for frontier AI access get measurably better output on complex tasks — better analysis, better code, and better judgement in ambiguous situations. Those using cheaper tiers get output that feels equally capable but consistently falls short on harder problems. Public sector organisations, typically more budget-constrained and slower to adopt, face this gap most acutely. The difference does not show up in the experience of using the tool; it shows up downstream in the quality of decisions that follow.

How should organisations choose between AI model tiers?

Build the workflow first, prove where the model adds genuine value, and then assess what can step down a tier without affecting outcomes. Use smaller or local models for drafting, summarising, reformatting, and routine code generation. Reserve frontier models for complex reasoning, ambiguous judgement, and high-stakes analysis. For many tasks, the bottleneck is not the model itself but the retrieval architecture, prompt design, or surrounding process. A cheaper model in a well-designed workflow often outperforms a frontier model in a poor one.

The Quiet Cost of Better AI

Oliver Nowak
May 21
5 min read

The AI cost curve is diverging, not falling uniformly, and most people can't tell which side of the gap they're on.

GitHub recently announced a significant shift in how Copilot is billed. From June 1, 2026, the platform is moving away from request-based "premium request" multipliers and switching to usage-based billing measured directly in tokens consumed. Base plan prices are unchanged, but what those plans buy you now depends on the per-token rate of whichever model you choose.

That rate varies enormously across the model lineup. Output from Anthropic's lightweight Claude Haiku 4.5 costs $5.00 per million tokens through Copilot, while Opus 4.7 sits at $25.00 per million, a fivefold gap within a single vendor on a single platform. Stretch across vendors and the spread widens further: GPT-5.4 nano lands at $1.25 per million output tokens, while GPT-5.5 reaches $30.00. The ratio between the cheapest and most capable model on the same platform is now roughly 24 to 1 within Anthropic's lineup, and approaches 80 to 1 when comparing the smallest OpenAI model to the largest.

What this pricing reveals is that AI is not getting uniformly cheaper. The commodity layer is, and that's the headline most people are reading. The shift to token-metered billing makes this divergence even more visible: under the old multiplier system, GitHub absorbed much of the real inference cost behind the scenes. That subsidy is now ending. But the frontier, the models producing genuinely better results on complex reasoning, is moving in the opposite direction. The gap between those tiers isn't narrowing, it's widening quickly.

Infographic on AI costs. Three sections show pricing models, per-token cost, and insights on AI affordability. Text highlights divergence in costs.

The inequality you don't feel

Anthropic ran an experiment in late 2025, published in April 2026 as Project Deal, that I think illustrates where this leads. They created a classified marketplace for 69 employees in their San Francisco office, each given a $100 budget to buy and sell items. Rather than negotiating directly, participants' AI agents handled all transactions autonomously. Some agents ran on Claude Opus 4.5, then Anthropic's frontier model. Others ran on Claude Haiku 4.5, the lightweight option.

By the end of the study, 186 deals had been completed. Opus-powered sellers pulled in roughly $2.68 more per item on average, while Opus-powered buyers paid about $2.45 less. On a $100 budget with only a handful of transactions each, that's a meaningful difference in outcomes.

The finding that probably matters most isn't the performance gap itself. It's what participants reported afterwards. Users with the weaker model described similar levels of satisfaction. They didn't feel disadvantaged. They didn't perceive anything was off. They carried on, unaware that their agent had consistently underperformed.

That finding has implications well beyond a controlled experiment. If people genuinely can't tell they're receiving a worse outcome, they can't make an informed choice about paying for a better one. And if the cost of the better outcome is rising ninefold, the divide between those who can afford frontier-tier AI and those who can't widens quietly, without anyone on the wrong side realising there's a gap to close.

Where can this be seen?

This isn't only about individual subscriptions. Businesses choosing between AI model tiers are making decisions that directly affect what their teams can produce. An organisation paying for frontier access gets measurably better output on complex tasks: better analysis, better code, better judgement in ambiguous situations. An organisation that can't justify the cost, or simply doesn't know the gap exists, settles for a cheaper tier and gets on with it.

At the individual level, the dynamic is the same. A freelancer paying for a premium AI subscription gets output that is measurably stronger on tasks requiring nuanced reasoning. Someone using a free or basic tier gets output that feels equally capable but consistently falls short on the harder problems. Both believe they're using AI effectively.

Governments face the same challenge, arguably more acutely. Public sector organisations tend to be more budget-constrained and slower to adopt. If the quality gap between model tiers continues to widen, the capability gap between public and private sector widens with it. That's a concerning trajectory for public services, healthcare, and education, all areas where the quality of AI-assisted analysis and decision-making will increasingly matter.

None of this looks like exclusion from the outside. Everyone has access to AI. The tools feel impressive regardless of tier. Outputs are fluent, confident, and plausible across the board. The difference doesn't show up in the experience of using the tool. It shows up downstream: in the accuracy of the analysis, in the edge cases caught or missed, in the quality of the decisions that follow.

How should you respond?

I'd argue that a considered AI workflow has to be willing to use frontier models, but only where the task genuinely demands them. A large proportion of day-to-day AI work, drafting, summarising, reformatting, routine code generation, runs perfectly well on smaller models. The mistake is treating every interaction as though it requires the most expensive option available.

The question is not "which model should I use for everything?" It is "to what degree can I step down the model without affecting the outcome?"

That's a case-by-case judgement, and not one you can make in the abstract. You build the workflow first, prove it works, understand where the model is genuinely contributing value, and then assess what can step down a tier. Starting with cost optimisation before the workflow itself is proven is optimising the wrong variable. And for many tasks, the bottleneck isn't the model at all. It's the retrieval architecture, the prompt design, or the surrounding process. Getting those right on a cheaper model often produces better results than throwing a frontier model at a poorly designed workflow.

Local open-weight models are increasingly part of this picture. Running on consumer hardware with no per-token cost, they change the economics of high-volume routine work considerably. The trade-offs are real: context windows are typically smaller, and the hardware needed to match frontier-level reasoning is not viable at consumer level. But for a meaningful share of tasks, a local model is a legitimate option that more organisations should be evaluating seriously.

The broader point is about building workflows that are cost-aware without being quality-blind. Using frontier models for everything is expensive and, for most tasks, unnecessary. Using the cheapest option across the board avoids the cost problem but introduces a quality problem that, as Anthropic's experiment suggests, most people won't notice until the consequences have already stacked up.

What does this mean for organisations?

The AI cost curve isn't a simple downward slope. It's a divergence. The commodity floor keeps getting cheaper while the frontier ceiling keeps pulling away. For anyone building AI into their operations, the practical response is to stop treating model selection as a blanket decision and start treating it as an ongoing, task-by-task operational judgement.

The gap between model tiers is real, it's widening, and the evidence so far suggests most people won't realise which side of it they're on.

The Quiet Cost of Better AI

The inequality you don't feel

Where can this be seen?

How should you respond?

What does this mean for organisations?

Further Reading

Recent Posts

Comments