The False Floor
The subsidised era of AI inference is ending. Most organisations are building as though it isn't.
This week, I have been building an agentic environment in my home lab. Nothing glamorous, just a set of AI agents designed to handle various tasks, running on locally hosted models. Most of the online guides I read talked about using cloud models like Claude or ChatGPT for agents, but I wanted to see how well it would run with local open source models.
The limitations became apparent quickly.
When you cannot just throw a task at a frontier model and let it reason its way to an answer, you have to think. Which agent actually needs to read the whole document, and which one only needs a summary? How much context does each step genuinely require? Do I need one agent trying to do everything, or four smaller ones each doing one thing well? I started with a monolithic architecture. I ended up with a small swarm of four agents, each scoped tightly to its task and running on a model matched to what that task actually demands.
I ran the numbers on what the same workflow would cost using a frontier model via API. Roughly one dollar per run. At the volumes I was considering that felt manageable, but then I thought about scale and asked myself what happens at a hundred runs a day? A thousand runs a day? What happens when the model’s next version uses three times as many tokens to reach the same answer, because the reasoning chain got longer? What happens when the price per token increases because the lab that set it decided the subsidy period was over?
That is not a hypothetical, it is a question that most organisations deploying AI right now are not asking, and they should be.
Current frontier AI pricing is structurally and heavily subsidised. OpenAI reported $3.7 billion in revenue in 2024 and lost $5 billion doing it. By the end of 2025, the company’s CFO was reporting an annualised revenue run rate exceeding $20 billion. OpenAI is still projected to lose roughly $14 billion in 2026. Revenue tripling does not mean the subsidy is ending, it means the subsidy is scaling.
The labs need adoption more than they need margin right now. Like most disruptive technologies, capturing the market comes first and profitability comes later. It is the same playbook that made cloud computing feel free until it didn’t, the same logic that kept ride-hailing cheap until the drivers needed paying. The difference is that organisations are not just buying a productivity tool this time. They are building processes around it. Automating workflows and training teams to depend on specific model behaviours. Embedding frontier API calls into the operational fabric of how they work.
When the floor moves, those decisions will be expensive to revisit.
Sam Altman said recently that he was “delighted to be wrong” about AI’s impact on white-collar jobs. The reversal was widely reported as reassurance, but what it actually signals is worth examining. It arrived the same week OpenAI reportedly filed IPO paperwork confidentially. A calmer narrative around AI’s economic disruption is considerably better for a public listing than the jobs apocalypse framing he was running twelve months ago. The people who set the price of the infrastructure you are building on have a direct financial interest in how you feel about it.
That is not a conspiracy, it is an incentive structure worth knowing about.
The floor is already moving. GitHub announced at the end of April that all Copilot plans would transition from flat-rate subscriptions to token-based billing on 1 June 2026. In its own announcement, GitHub explained why with unusual candour: “Today, a quick chat question and a multi-hour autonomous coding session can cost the user the same amount. GitHub has absorbed much of the escalating inference cost behind that usage, but the current premium request model is no longer sustainable.” GitHub is not an outlier. Cursor made a similar shift in June 2025, moving from request-based limits to credit pools tied to API costs, poorly enough communicated that the company issued a public apology and offered refunds. Windsurf followed in March 2026. We see the same headline monthly prices, but the bills are going up. The market is converging on the same structure, and the reason is always the same: agentic workflows broke the flat-rate economics.
The tokenmaxxing stories of the past month are not cautionary tales about individual excess. They are what unconstrained frontier API access looks like at scale.
Uber deployed Claude Code to around 5,000 engineers and watched adoption climb from 32 percent in February to 84 percent by March. Per-engineer API costs reached between $500 and $2,000 a month. By April, the company had exhausted its entire planned 2026 AI budget only four months into the year. The CTO reported spending $1,200 in a single two-hour session.
Microsoft introduced Claude Code to thousands of engineers across its Experiences and Devices division, the team responsible for Windows, Microsoft 365, Outlook, Teams, and Surface in December 2025. Engineers preferred it to the in-house alternative and used it constantly. By May, Microsoft was cancelling the licences, effective the last day of the financial year because the expensive tool worked too well. That is the part that gets under-reported: the problem was not that engineers were wasting tokens. The problem was that they were not.
Elsewhere, an AI consultant told Axios that one of their enterprise clients ran up a $500 million bill on Claude in a single month. No spending caps, no usage controls, just unrestricted access and a workforce that used it. That figure comes from a single unnamed source and has not been independently confirmed. But the pattern it describes, with costs that compound invisibly until they arrive all at once, is consistent with everything else happening in this space right now.
The mechanism matters here, and it is specific to agentic workflows. A single prompt to a language model is a bounded transaction. An agent running a multi-step workflow is not. It reads context, reasons, makes decisions, calls tools, then re-reads everything, the original prompt, every response, every tool output, before the next step, the context snowballs. A peer-reviewed study published in April 2026 found agentic tasks consume up to 1,000 times more tokens than standard model interactions. Model updates that shift reasoning architecture can change your consumption profile overnight, with no change to your code. The workflow that cost one dollar per run this quarter may cost five next quarter, because the model got better at thinking and thinking costs tokens.
The headline pricing narrative is that AI is getting cheaper, but the rate cards tell a different story. Claude Opus 4.8 costs five times more per token than Claude Haiku 4.5. Google’s Gemini Flash tier, designed as the affordable option, has risen five-fold in input price in under a year. One independent developer noted last week that all three major labs appear to be “probing the price tolerance of their API customers.” Token consumption is also rising faster than anyone budgeted for. You are paying more per unit, for more units, and the unit count is accelerating.
There is a question that most organisations deploying agentic AI are not asking. It costs nothing to ask now and a great deal to answer later.
Which tasks in this workflow actually require frontier reasoning?
Routing a document, classifying a ticket, summarising a meeting transcript, none of these require the most capable model on the planet. According to Epoch AI, the most capable open-weight models now lag frontier closed models by an average of four months on aggregate capability measures. On coding and production workloads specifically, independent benchmarks put the gap as low as two to three percentage points, while open-weight models cost six to seven times less per output token. The gap that remains is real, but common enterprise workloads are not in it.
Frontier models earn their place. Complex reasoning under ambiguity, novel analysis, judgement calls at the edge of a model’s capability, these are genuinely different tasks that do benefit from the best available models. The question is not whether to use frontier models, it is whether every step in every workflow needs them.
The organisations that are not asking this question are not making a considered architectural choice. They are making the path-of-least-resistance choice while the pricing floor is low and the pressure to deploy is high. Whilst that is understandable, it is also how you end up with a system you cannot change without rebuilding it.
Migration costs more than people model as it is not just rewriting API calls. It is revalidating outputs, because different models produce subtly different results and the downstream processes were calibrated to the original ones. It is rewriting prompt logic tuned over months to a specific model’s behaviour. It is re-testing agentic chains where one agent’s output format feeds the next agent’s input. And it is redoing the governance and risk assessment you completed the first time, under the time pressure of a system already in production.
That is the real lock-in, not contractual, but architectural.
Last week, I wrote about AI sovereignty, about what it means for public institutions to run critical infrastructure on systems controlled by private shareholders in another jurisdiction. The inference pricing question is the same argument, just one layer down.
An organisation that has routed its operational workflows through a frontier API has not just taken on a vendor relationship. It has taken on exposure to that vendor’s pricing decisions, its infrastructure availability, its jurisdictional obligations, and its commercial priorities. For most organisations, that is a manageable business risk. For critical national infrastructure, energy, water, transport, healthcare, it is a different category of problem entirely. Embedding frontier API dependencies into operational technology creates single points of failure in systems that were previously distributed and resilient by design.
The argument for thinking about this now is simple. The constraint my home lab imposed on me, to think about what each agent needs, to match the model to the task, to design for predictability and cost, is the same constraint that every organisation will face eventually. The difference is that I just faced it at home. Organisations that defer it face it later, under budget pressure, with running systems, and with users who have reorganised their work around the existing architecture.
GitHub named the problem in its own announcement. A quick chat and a multi-hour agentic session should not cost the same. The flat-rate era assumed they would and it was wrong. The organisations now building agentic workflows on frontier APIs without asking which tasks actually need that level of capability are making the same assumption, they are just making it more expensive.
The floor is not permanent, and the decisions made while it holds are not either. But reversing them later, on running systems, under budget pressure, will cost considerably more than asking the right questions now. The bill is coming, the only question is whether we see it coming.
I write about AI, cybersecurity, and technology every Friday. Subscribe to get it in your inbox.
Sources & Further Reading
GitHub Blog. (27 April 2026). GitHub Copilot is moving to usage-based billing. github.blog/news-insights/company-news/github-copilot-is-moving-to-usage-based-billing
Flexprice. (18 April 2026). The complete guide to Cursor pricing in 2026. Cites Cursor public apology of 4 July 2025 and June 2025 billing change. flexprice.io/blog/cursor-pricing-guide
Forbes / The Information. (April 2026). Uber burns through entire 2026 AI budget in four months after Claude Code deployment. Reported across multiple outlets.
The Verge / TechRadar / People Matters. (May 2026). Microsoft cancels Claude Code licences across Experiences and Devices division. Reported across multiple outlets.
Axios via Tech Startups. (May 2026). Enterprise client runs up $500 million Claude bill in a single month. Source: unnamed AI consultant. Unverified, unattributed. techstartups.com
Bai et al. (29 April 2026). How Do AI Agents Spend Your Money? Analysing and Predicting Token Consumption in Agentic Coding Tasks. arXiv:2604.22750. Authors include Erik Brynjolfsson, Stanford Digital Economy Lab. arxiv.org/abs/2604.22750
Simon Willison. (19 May 2026). Gemini 3.5 Flash: more expensive, but Google plan to use it for everything. Source of “probing the price tolerance of their API customers” quote. simonwillison.net/2026/May/19/gemini-35-flash
Edwards, J. and Emberson, L. (2026). Open models lag state-of-the-art closed models by 4 months. Epoch AI. epoch.ai/data-insights/open-closed-eci-gap
OpenAI financial figures: $3.7bn revenue / $5bn loss 2024 — multiple sources. $20bn ARR and $14bn projected 2026 loss — Fortune, CNBC, Reuters, January 2026. OpenAI CFO Sarah Friar blog post, 18 January 2026.
Claude API pricing: Anthropic platform pricing page. Haiku 4.5 at $1/$5 per million tokens; Opus 4.8 at $5/$25 per million tokens. anthropic.com
Gemini Flash pricing trajectory: Gemini 2.5 Flash at $0.30/$2.50 (June 2025) to Gemini 3.5 Flash at $1.50/$9.00 (May 2026). Google AI / Gemini API documentation and simonwillison.net analysis.
Deep Infra. (May 2026). Open-Source vs Closed-Source AI Models: Is the Gap Worth It? Cites 2–3 percentage point coding benchmark gap and 6–7x output token cost advantage for open-weight models. deepinfra.com/blog/open-source-vs-closed-source-ai-models-price-gap


