top of page

How Model Routing Works: Techniques, Players, and Frameworks

  • TomT
  • Apr 23
  • 8 min read

Model routing is the practice of sending each request to the cheapest model that can actually handle it, instead of defaulting every task to the flagship. It is the single largest cost lever in the token scarcity era — and, done well, it improves quality and cost at the same time. This is Part 1 of The Token Scarcity Playbook.

"The insight isn't that the cheap model beat the flagship. It's that routing beat brute force."

Table of Contents


The four approaches to routing

Strip away the marketing and essentially all production routing falls into four families, distinguished by what signal they use to decide — a taxonomy a 2026 survey of dynamic routing and cascading frames along three axes: when the decision is made, what information it uses, and how it is computed (arXiv:2603.04445).

Semantic / intent routing reads the incoming request — usually with a lightweight embedding model or a small fine-tuned classifier — maps it to a task category, and dispatches to the cheapest model that reliably handles that category. Cascading starts cheap and escalates only on failure. Budget-constrained routing treats cost as a hard ceiling and maximizes quality within it. Mechanistic routing is the frontier: it decides from the model's own internal activations rather than the surface text. The first two are mature and in production everywhere; the last two are where the research is moving. The common runtime shape is a classifier in front of a tiered set of models, with an escalation gate between tiers:

Figure: routing with cascading escalation.


Cascading and escalation

A cascade starts with the cheapest model and escalates only when a quality gate fails — a confidence score, the model's own self-verification, or a lightweight judge decides whether the cheap answer stands or the query moves up a tier. The idea is almost embarrassingly simple: most queries don't need your best model, so stop paying for it on the ones that don't. The 2026 routing survey catalogs cascading as one of the core paradigms and names the inefficiency it fixes — "overspending on easy queries, underspending on hard ones" (arXiv:2603.04445).

Routing has also generalized beyond a fixed model pool. UniRoute (Universal Model Routing) represents each model as a feature vector from its performance on representative prompts, and routes to the smallest capable model even among 30+ models unseen at training time — which matters in production, where new model versions arrive constantly (arXiv:2502.08773). Cascading's one weakness is latency: an escalation runs two models in sequence — fine for asynchronous work, painful for interactive UX — so choose it where you can absorb the occasional double-hop.


Budget-constrained routing

When cost is a hard line rather than something to minimize, the framing changes. PILOT (Preference-prior Informed LinUCB for Adaptive Routing, arXiv:2508.21141) treats routing as a contextual bandit — learning a shared embedding of queries and models from binary thumbs-up/down feedback, then allocating a fixed budget across incoming queries with an online multi-choice knapsack formulation. The result: 93% of GPT-4's performance at 25% of its cost in the multi-task setting. The practical appeal is the dial — operators can move the cost/quality trade-off at inference time without retraining the router.


Mechanistic routing

The newest family routes on signals from inside the model. "LLM Router: Rethinking Routing with Prefill Activations" (arXiv:2603.20895) makes routing decisions from internal prefill activations rather than surface features, on the premise that a model's hidden state during prefill already encodes how hard the query is and whether it's likely to succeed. It is the most accurate signal and the most expensive to implement, and it is mostly still in the lab — but it is the clearest direction of travel: per-step routing decisions, e.g. R2-Reasoner's reported 84% API-cost reduction by routing each reasoning step to a different model (arXiv:2506.05901).


Routing in production: the worker–advisor pattern

The most quoted production proof point in 2026 is Harvey's Legal Agent Benchmark. In its hybrid setup, an open-source model did the bulk of the work as the worker, and a frontier model (Claude Opus) was available as a callable advisor, invoked sparingly — under once per task on average. By Harvey's reported figures, the hybrid passed more of the benchmark at roughly $368 versus ~$954 for the frontier model run standalone (Harvey; Fireworks; ZenML).

The pattern generalizes: pay frontier prices only for the subproblems that genuinely need frontier reasoning, and let a cheaper worker handle the rest. Opus isn't the engine here — it's a tool the engine calls when it's stuck.

Figure: the worker–advisor pattern.

Implementation notes. Build the router as a small fine-tuned classifier or embedding-similarity match keyed to task classes, and keep the routing table in config so thresholds are tunable without a redeploy. Gate cascade escalation on a cheap signal — self-consistency, a logprob/confidence threshold, or a lightweight judge — and cap escalation depth to bound worst-case latency. For the worker–advisor pattern, expose the frontier model as a callable tool with a hard per-task call budget, and log every advisor invocation so you can confirm it stays sparse. Cascades suit async/batch work; for interactive UX, prefer routing up front with the classifier over escalating mid-request.

Routing below the application layer

Routing isn't only an application concern; it happens in the infrastructure too. NVIDIA's open-source vLLM Semantic Router uses a ModernBERT classifier to send reasoning-heavy queries to chain-of-thought models and simple ones to a fast path, reporting meaningful accuracy, latency, and token improvements (vLLM; Red Hat). One level lower still, NVIDIA Run:ai routes at the hardware layer with GPU fractioning — its published benchmarks show a 0.5-GPU fraction delivering 152,694 tokens/sec, about 77% of a full GPU's throughput, letting more workloads share fewer GPUs (NVIDIA). The lesson: "which model" is only one routing question; "which GPU fraction" is another, and both move the bill.


The commercial routing layer

If you don't want to build any of this, the productized options are now strong. OpenRouter's Auto Router (openrouter/auto) picks a model per prompt and charges the standard rate for whatever it selects (OpenRouter) — a one-line entry point, and OpenRouter's own Series B funding signals how durable this layer is judged to be (OpenRouter). LiteLLM gives you a unified, OpenAI-compatible interface across 100+ providers with fallback and cost tracking, for teams that want routing control on their own infrastructure (GitHub). And Factory Router reports delivering Opus-level performance at 20–25% lower cost by routing each task type to the point on the cost-quality curve that holds most of the quality (Factory).


Reference design: a production routing layer

Putting the pieces together, a production routing layer is six components: a request hits a router that classifies it; a routing policy maps the class to a default tier and cost budget; the model runs; a quality gate decides whether to return or escalate; telemetry tracks cost-per-outcome; and a failover path covers provider outages.


Figure: model-routing layer reference architecture.


What a reference design is. It's a blueprint you adapt, not a library you install — it names the components a routing layer needs and how they connect, so you can implement it on any stack (LangGraph, LiteLLM, or a custom service) and still get the same properties: cheap-by-default, escalate-on-need, observable, and resilient. The defining principle is separation of the routing decision from execution: the router decides what kind of request this is, a policy (plain versioned config) decides which model that kind gets, and the model simply runs. Keeping the decision in config rather than code means you can re-tune routing — or roll it back — without redeploying the application, and you can test the policy in isolation.


How a request flows through it. A request arrives and the router classifies it into a task class — "simple extraction," "code generation," "multi-step reasoning," and so on — using cheap embedding-similarity or a small classifier. That step is fast and nearly free, so the routing decision itself doesn't erode the savings it unlocks. The routing policy looks up that class and returns the default tier plus a cost budget. The selected model in the tier pool executes the task, and its output passes through the quality gate, which applies one cheap check — a confidence score, the model's own self-verification, or a small judge — and then either returns the result or escalates the same request to the next tier up. Every outcome is recorded by telemetry as cost-per-outcome, latency, and escalation rate — the signal that tells you whether the policy is still tuned correctly. If a provider is down or capacity-constrained, failover reroutes to an alternate so a single outage doesn't take the system down.


Worked example. A one-line documentation fix arrives. The router tags it "simple edit," the policy sends it to the fast tier, the model returns a clean diff, the gate's confidence check passes, and it returns — for a fraction of a cent, never touching the flagship. The next request is a complex architecture question: the router tags it "multi-step reasoning," the policy starts it on the mid tier, the gate's self-verification flags low confidence, and it escalates to the flagship, which answers. Two requests, two very different costs, one policy doing the deciding — that is the entire point of the design.

The six components, and what to build each with:

Component

Responsibility

Build it with

Router / classifier

Map each request to a task class

Embedding-similarity to class exemplars, or a small fine-tuned classifier (ModernBERT-class); vLLM Semantic Router

Routing policy

Task class → default tier + cost budget

Versioned config holding the Pareto point per class

Model tier pool

Candidate models per tier (fast / balanced / flagship)

A unified interface such as LiteLLM

Quality gate

Decide pass vs. escalate

Confidence/logprob threshold, self-verification, or a lightweight judge

Telemetry

Cost-per-outcome, latency, escalation rate

LangSmith or OpenTelemetry

Failover

Provider outage / capacity

LiteLLM fallback routing

Build steps:

  1. Define a task taxonomy and a golden eval set per class — and make cost per correct outcome the metric (Part 0), not cost per token.

  2. Build the router — start with embedding-similarity to per-class exemplars; graduate to a fine-tuned classifier only if accuracy demands it.

  3. Set the policy at the Pareto point — for each class, the cheapest tier that holds ~95% of frontier quality on that class's eval; record it in versioned config so it's testable and reversible.

  4. Add the escalation gate on a cheap signal, and cap escalation depth to bound worst-case latency.

  5. For agentic tasks, use the worker–advisor pattern — the frontier model as a budgeted, callable tool, not the default executor.

  6. Instrument and re-tune — watch escalation rate and cost-per-outcome, and revisit the policy as models and prices change.


Build vs. buy:

Your situation

Use

Fastest start, no infrastructure

Self-hosted, multi-provider, on-prem data

Routing as explicit, testable graph nodes

Domain quality + cost at scale

Worker–advisor (the Harvey pattern)


Adaptation tactics

Routing is the highest-leverage move in the scarcity era, and you don't need the research frontier to capture most of it:

  • Default cheap, escalate on signal. Start every task on the cheapest capable tier; promote only when a quality check or confidence score says to — the cascading pattern catalogued in the dynamic routing & cascading survey and generalized by UniRoute.

  • Use the worker–advisor pattern for agents. Make the frontier model a callable tool, not the default executor — proven in production by Harvey's Legal Agent Benchmark (hybrid ≈ $368 vs frontier-only ≈ $954; ZenML write-up).

  • Adopt a routing layer before building one. OpenRouter Auto Router or LiteLLM capture most of the savings in an afternoon; build custom routing (e.g. LangGraph conditional edges) only when you need explicit control.

  • Measure cost per correct outcome, not per token — verbose reasoning models can burn up to 5x more tokens at equal accuracy, so price-per-token routing can pick the model that's more expensive per task (Part 0).

The next part addresses the risk routing creates: once you're swapping models per task, your application has to survive those swaps without breaking.



References

  1. Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey (arXiv:2603.04445, Feb 2026)

  2. UniRoute — Universal Model Routing for Efficient LLM Inference (arXiv:2502.08773, 2025)

  3. PILOT — contextual-bandit budget routing, 93% of GPT-4 at 25% cost (arXiv:2508.21141)

  4. LLM Router: Rethinking Routing with Prefill Activations (arXiv:2603.20895)

  5. R2-Reasoner — per-step routing, ~84% API cost reduction (arXiv:2506.05901)

  6. Harvey — Introducing the Legal Agent Benchmark

  7. Fireworks AI — Kimi K2.5 post-training

  8. ZenML — Harvey hybrid worker/advisor architecture

  9. vLLM — Semantic Router

  10. Red Hat — vLLM Semantic Router

  11. NVIDIA — GPU fractioning throughput with Run:ai

  12. OpenRouter — Auto Router docs

  13. OpenRouter — Series B

  14. LiteLLM — GitHub

  15. Factory — Factory Router (Opus performance at 20–25% lower cost)

Recent Posts

See All

Comments


bottom of page