The Token Scarcity Era: From AI Subsidy to the New Constraint

TomT
Apr 1
5 min read

Updated: Jun 11

The token scarcity era is the moment the economics of building with AI inverted — when "use the biggest model for everything" stopped being clever and started being expensive. This is Part 0 of The Token Scarcity Playbook: why every team shipping AI in 2026 is suddenly counting tokens, and the tactics that follow.

"Per-token pricing is the rate. Tokens-to-completion is the invoice."

From subsidized inference to metered billing
Demand growth outpacing compute supply
Cost-per-token vs. cost-per-task
Application-layer adaptation tactics
What this series covers

From subsidized inference to metered billing

For about two years, the smartest way to build with AI was to not think about cost at all.

The max-tier lab subscriptions — the $100, $200, and $300-a-month plans — were a quiet bargain for the people using them and a loss-maker for the companies selling them. In January 2025, Sam Altman said plainly that OpenAI was losing money on its $200/month ChatGPT Pro plan because people used it far more than the price assumed (Fortune). That was the whole era in one admission: the labs subsidized usage to win developers, and the most active developers extracted many times what they paid. When the marginal cost of an experiment rounds to zero, you stop weighing experiments — you just point an agent at the whole repository "to see." The subsidy didn't only lower costs; it removed a discipline nobody noticed was load-bearing.

It came back over a few weeks in spring 2026. GitHub Copilot — the most widely deployed AI coding tool anywhere — switched from flat seat pricing to usage-based "AI Credits" billing (GitHub), and developers revolted at getting less for the same price (Visual Studio Magazine). Enterprises hit the wall from the other side: Walmart capped an internal AI tool after demand outran budget (Bloomberg), and Uber reportedly burned its annual AI-tooling budget in roughly four months, then imposed a hard per-developer cap (TechCrunch). None of them decided AI was a bad bet. They decided, all at once, that they had to manage it.

Demand growth outpacing compute supply

The subsidy couldn't last because the demand curve stopped looking like a curve.

Google gives the cleanest first-party read: in Q4 2025 its models were processing over 10 billion tokens per minute through its APIs, up from ~7 billion a quarter earlier — roughly 14 trillion a day from one provider (Alphabet Q4 2025). Goldman Sachs projects agentic AI drives a ~24x increase in token consumption by 2030 (Goldman Sachs). The word doing the work there is agentic: a Stanford and Microsoft study found agentic coding tasks consume on the order of 1,000x more tokens than chat, dominated by the input context an agent re-reads on every step (arXiv:2604.22750).

Supply couldn't keep pace. SemiAnalysis described a real silicon shortage, with AI on track to consume just under 60% of TSMC's leading-edge N3 capacity in 2026 (SemiAnalysis); HBM and DRAM moved into outright shortage (Fortune). The squeeze reached the labs directly — Anthropic, compute-constrained, struck a deal to use SpaceX Colossus capacity just to raise its usage limits (Anthropic). When a frontier lab rents a rocket company's data center to serve demand, the subsidy era is structurally over.

Cost-per-token vs. cost-per-task

The most useful correction of this era is also the simplest: the price per token is the rate, but tokens-to-completion is the invoice.

A model can win on sticker price and lose badly on the job, because one that "thinks out loud" burns multiples more tokens to reach the same answer. It's measured: the OckBench study found reasoning models spending up to 5x more tokens at equal accuracy — the same correct answer, five times the bill (arXiv:2511.05722). So "cheapest per token" and "cheapest per task" are different numbers that routinely disagree, and using the most expensive model for everything isn't a quality strategy — it's a tax for not having built the machinery to choose. The right metric is cost per correct outcome, not cost per million tokens.

Application-layer adaptation tactics

The same pressure that ended the free lunch is producing better engineering — the token cost of intelligence is falling fast even as total demand rises. Four tactics define how teams adapt, and each gets a part of this series:

Route, don't default. Send each task to the cheapest model that can actually do it, escalating only when the task demands it. This is the single biggest lever (Part 1).
Stay model-agnostic. Build so you can swap models as cheaper ones catch up — behind a stable contract, with tests that catch regressions before users do (Part 2).
Pick the right harness. Agent frameworks differ sharply in token overhead and control; the framework choice is itself a cost decision (Part 3).
Engineer the context. In agentic workloads the context is the cost, so manage it deliberately — offload, summarize, and only load the tools you need (Part 4).

There's a human lever too: a KPMG / UT-Austin study of 1.4 million workplace AI interactions found the highest-impact users weren't better at prompt syntax — they framed problems and iterated deliberately, which also wastes fewer tokens (KPMG / UT-Austin). Precision is a discipline, and it's teachable.

What this series covers

The shortage isn't a passing squeeze to wait out — the supply chain behind it (power, fabs, memory) takes years to build, so efficiency is the defining constraint of the next several years. The rest of the playbook is the practical response:

Part 1 — How Model Routing Works: techniques, players, and frameworks for routing each task to the right model.
Part 2 — Making Applications Model-Agnostic by Design: swapping models safely, with testing and grouping.
Part 3 — Agent Frameworks Compared: LangGraph, LangChain, CrewAI, AWS Strands, Google ADK, and deepagents.
Part 4 — Building a Multi-Agent Orchestration System: routing plus a long-horizon harness, in production on AWS.

The subsidy era rewarded speed without thought. The scarcity era rewards the same speed with thought added back — which is a better way to build anyway.

Next → Part 1: How Model Routing Works.

The Token Scarcity Era: From AI Subsidy to the New Constraint

Table of Contents

From subsidized inference to metered billing

Demand growth outpacing compute supply

Cost-per-token vs. cost-per-task

Application-layer adaptation tactics

What this series covers

References

Recent Posts