Making Applications Model-Agnostic by Design

TomT
May 7
6 min read

A model-agnostic application is one you can move from one model to another — to capture a cheaper option or route per task — without rewriting logic or shipping silent regressions. Routing (Part 1) only pays off if your system survives the swaps. This is Part 2 of The Token Scarcity Playbook.

"The schema is the contract; the model is an implementation detail."

Why model swaps break things
Designing for agnosticism
Grouping models into tiers
The benchmarks that still differentiate
The testing stack
Adaptation tactics

Why model swaps break things

Traditional software testing rests on an assumption LLMs violate: that the same input yields the same output. It doesn't — not even at temperature zero.

This is well documented in recent work. The instability isn't only sampling: changing the serving configuration — GPU count, GPU type, batch size — shifts outputs too. A 2025 study showed that under bf16 greedy decoding (temperature zero), a reasoning model exhibited up to 9% accuracy variation and a 9,000-token swing in response length purely from infrastructure differences (arXiv:2506.09501). And repeating identical prompts five times at temperature zero still produced inconsistent results across multiple models — limited test-retest reliability even with sampling disabled (arXiv:2502.20747).

If a single model wobbles this much, swapping between different models — with different verbosity, formatting, and instruction-following defaults — is a far larger jump. The job of a model-agnostic design is to absorb that jump.

Designing for agnosticism

The core move is to make the model an implementation detail behind a stable contract. Three rules carry most of the weight.

First, define every interface as a typed schema, not freeform text. If model A and model B both return {"action": "search", "query": "..."}, the next step in your pipeline doesn't care how differently they reasoned to get there. The schema is the contract; the model is swappable underneath it.

Second, write prompts to a capability tier, not a model's quirks. A prompt that assumes "a model that follows multi-step instructions and emits JSON" ports across providers; one that leans on a single model's idiosyncratic behavior does not. Verbose, explicit instructions travel better than clever short ones.

Third, separate the prompt from model selection. Keep prompts versioned in a registry so that when you swap models, the prompt is a controlled variable — you can test whether the existing prompt works on the new model instead of changing both at once.

The shape that results: business logic never touches a provider SDK directly; it emits a model-agnostic prompt against a versioned template, a router picks the tier, and a schema validator guarantees the output contract regardless of which model produced it.

Figure: model-agnostic application architecture.

Implementation notes. Define the contract with Pydantic or JSON Schema and enforce it with strict/constrained decoding where the provider supports it; on a validation miss, run a bounded repair loop before escalating a tier. Keep prompts in a real registry (a git-versioned directory or a tool like LangSmith) keyed by task_class + version, never inline in app code. Put a thin abstraction (LiteLLM or your own adapter) between business logic and providers so a swap is a config change, not a code change.

Grouping models into tiers

In practice teams organize candidate models into a few interchangeable tiers — a fast/cheap tier, a balanced tier, and a flagship tier — and route within them. The tiering is a useful framing, not a law: interchangeability is always relative to a task. Two models that are interchangeable for code generation may diverge badly on structured legal extraction.

What makes two models genuinely interchangeable for a task is measurable: comparable schema-conformance on your output format, semantic equivalence on a golden set (better checked with an LLM-as-judge equivalence test than with string overlap — see the testing stack below), task-success parity within a few percent, and — critically — overlapping failure modes. Two models can match on average and still fail on completely different inputs, which is exactly the regression a swap introduces.

The benchmarks that still differentiate

Standard benchmarks no longer help here: MMLU and HumanEval are saturated above 90% across frontier models. The benchmarks that still separate models in 2026 are harder and more specific:

BFCL (Berkeley Function-Calling Leaderboard) for tool-use accuracy — the failure mode that actually breaks agents, and a continuously updated live leaderboard (Gorilla).
SciDesignBench (2026), scientist-grounded inverse-design tasks — where the best zero-shot model scores just 29.0% across 520 tasks in 14 domains, a vivid reminder that "saturated" is benchmark-specific (arXiv:2603.12724).

For agentic interchangeability, BFCL is the one to watch: tool-call reliability is what breaks pipelines, not trivia recall.

The testing stack

Catching swap regressions takes a layered evaluation pipeline, not a single accuracy number. Treat it as four gates, each of which can block the release on its own:

Figure: the four-gate model-swap validation pipeline.

It starts with a golden dataset built from real production failures, one per task class, not synthetic examples. The release decision should hinge on per-cohort deltas, not the aggregate: an overall pass rate creeping from 0.91 to 0.93 can hide a specific cohort collapsing from 0.94 to 0.83 — exactly the regression that reaches users.

On top of that sits LLM-as-judge for semantic equivalence — fast and cheap, but to be used with eyes open. A late-2025 study found even top judges (GPT-5, Gemini 2.5 Pro) fail to hold consistent preferences in nearly a quarter of hard cases, and that human annotation itself is inconsistent enough to be a shaky "gold standard" (arXiv:2512.16041). The same work points to what actually helps: explicit rubrics, panel-of-judges aggregation, and fine-tuned judges — and you should always judge with a different model family than the one under test to avoid self-enhancement bias. For swap validation the question isn't "is this good," it's "is this equivalent enough to substitute."

Two more layers harden it: systematic regression testing across the swap — frameworks like ReCatcher compare a current model against a candidate update across logical correctness, code quality, and performance, and find correctness and error-handling the most regression-prone areas when models change (arXiv:2507.19390) — and shadow deployment, running the candidate on a sample of live traffic before committing, to catch the distribution shift your golden set didn't anticipate. When prompts don't port cleanly, automated per-model prompt optimization can close the gap; Amazon's Promptimus reports gains of 3.18%–90.27% across seven models and nine tasks with no manual prompt engineering (Amazon Science).

Reference design: the model-agnostic stack

The pattern is contract-first: business logic talks to a versioned prompt and a typed output contract — never a provider SDK directly; an abstraction layer routes to the model; and an eval harness gates every change.

Figure: model-agnostic stack reference architecture.

Layer	Responsibility	Build it with
Typed output contract	The swap-safe boundary	Pydantic / JSON Schema + constrained decoding; a bounded repair loop
Prompt registry	Versioned, capability-tier prompts	Git-versioned directory or LangSmith
Abstraction layer	Provider-agnostic calls, fallback, cost tracking	LiteLLM
Provider pool	Tiers + data-boundary routing	Bedrock in-VPC (sensitive) + external (non-sensitive)
Eval harness	Golden sets, LLM-judge, slice/regression, shadow	LLM judge (different family); regression testing (ReCatcher); Langfuse shadow runs
Release gate	Decide on per-cohort deltas, not aggregate	A CI quality gate

Model-swap runbook:

Freeze the contract and prompt version — swap only the model, so any regression is isolated to it.
Run the candidate against the golden set for every task class.
LLM-judge equivalence against the incumbent, using a different model family (judge-reliability caveats).
Slice / regression detection for per-cohort deltas.
Shadow-deploy on 5–10% of live traffic to catch distribution shift.
Promote only if every per-cohort delta passes — otherwise block. The aggregate score is never the gate.

Adaptation tactics

Make schemas the contract. Typed, validated outputs make the model swappable by construction; never let a downstream step parse freeform prose.
Gate releases on per-cohort deltas, not aggregate scores — that's where swap regressions hide.
Run an LLM judge with a different model family, and treat the question as substitutability, not quality.
Shadow-deploy before you switch. Real traffic finds what your golden set missed.

With routing in place and swaps made safe, the next question is what runs the agent itself — the framework that orchestrates the work.

Next → Part 3: Agent Frameworks Compared.