Making Applications Model-Agnostic by Design
- TomT
- May 7
- 6 min read
A model-agnostic application is one you can move from one model to another — to capture a cheaper option or route per task — without rewriting logic or shipping silent regressions. Routing (Part 1) only pays off if your system survives the swaps. This is Part 2 of The Token Scarcity Playbook.
"The schema is the contract; the model is an implementation detail."
Table of Contents
Why model swaps break things
Traditional software testing rests on an assumption LLMs violate: that the same input yields the same output. It doesn't — not even at temperature zero.
This is well documented in recent work. The instability isn't only sampling: changing the serving configuration — GPU count, GPU type, batch size — shifts outputs too. A 2025 study showed that under bf16 greedy decoding (temperature zero), a reasoning model exhibited up to 9% accuracy variation and a 9,000-token swing in response length purely from infrastructure differences (arXiv:2506.09501). And repeating identical prompts five times at temperature zero still produced inconsistent results across multiple models — limited test-retest reliability even with sampling disabled (arXiv:2502.20747).
If a single model wobbles this much, swapping between different models — with different verbosity, formatting, and instruction-following defaults — is a far larger jump. The job of a model-agnostic design is to absorb that jump.
Designing for agnosticism
The core move is to make the model an implementation detail behind a stable contract. Three rules carry most of the weight.
First, define every interface as a typed schema, not freeform text. If model A and model B both return {"action": "search", "query": "..."}, the next step in your pipeline doesn't care how differently they reasoned to get there. The schema is the contract; the model is swappable underneath it.
Second, write prompts to a capability tier, not a model's quirks. A prompt that assumes "a model that follows multi-step instructions and emits JSON" ports across providers; one that leans on a single model's idiosyncratic behavior does not. Verbose, explicit instructions travel better than clever short ones.
Third, separate the prompt from model selection. Keep prompts versioned in a registry so that when you swap models, the prompt is a controlled variable — you can test whether the existing prompt works on the new model instead of changing both at once.
The shape that results: business logic never touches a provider SDK directly; it emits a model-agnostic prompt against a versioned template, a router picks the tier, and a schema validator guarantees the output contract regardless of which model produced it.

Figure: model-agnostic application architecture.
Implementation notes. Define the contract with Pydantic or JSON Schema and enforce it with strict/constrained decoding where the provider supports it; on a validation miss, run a bounded repair loop before escalating a tier. Keep prompts in a real registry (a git-versioned directory or a tool like LangSmith) keyed by task_class + version, never inline in app code. Put a thin abstraction (LiteLLM or your own adapter) between business logic and providers so a swap is a config change, not a code change.
Grouping models into tiers
In practice teams organize candidate models into a few interchangeable tiers — a fast/cheap tier, a balanced tier, and a flagship tier — and route within them. The tiering is a useful framing, not a law: interchangeability is always relative to a task. Two models that are interchangeable for code generation may diverge badly on structured legal extraction.
What makes two models genuinely interchangeable for a task is measurable: comparable schema-conformance on your output format, semantic equivalence on a golden set (better checked with an LLM-as-judge equivalence test than with string overlap — see the testing stack below), task-success parity within a few percent, and — critically — overlapping failure modes. Two models can match on average and still fail on completely different inputs, which is exactly the regression a swap introduces.
The benchmarks that still differentiate
Standard benchmarks no longer help here: MMLU and HumanEval are saturated above 90% across frontier models. The benchmarks that still separate models in 2026 are harder and more specific:
BFCL (Berkeley Function-Calling Leaderboard) for tool-use accuracy — the failure mode that actually breaks agents, and a continuously updated live leaderboard (Gorilla).
SciDesignBench (2026), scientist-grounded inverse-design tasks — where the best zero-shot model scores just 29.0% across 520 tasks in 14 domains, a vivid reminder that "saturated" is benchmark-specific (arXiv:2603.12724).
For agentic interchangeability, BFCL is the one to watch: tool-call reliability is what breaks pipelines, not trivia recall.
The testing stack
Catching swap regressions takes a layered evaluation pipeline, not a single accuracy number. Treat it as four gates, each of which can block the release on its own:

Figure: the four-gate model-swap validation pipeline.
It starts with a golden dataset built from real production failures, one per task class, not synthetic examples. The release decision should hinge on per-cohort deltas, not the aggregate: an overall pass rate creeping from 0.91 to 0.93 can hide a specific cohort collapsing from 0.94 to 0.83 — exactly the regression that reaches users.
On top of that sits LLM-as-judge for semantic equivalence — fast and cheap, but to be used with eyes open. A late-2025 study found even top judges (GPT-5, Gemini 2.5 Pro) fail to hold consistent preferences in nearly a quarter of hard cases, and that human annotation itself is inconsistent enough to be a shaky "gold standard" (arXiv:2512.16041). The same work points to what actually helps: explicit rubrics, panel-of-judges aggregation, and fine-tuned judges — and you should always judge with a different model family than the one under test to avoid self-enhancement bias. For swap validation the question isn't "is this good," it's "is this equivalent enough to substitute."
Two more layers harden it: systematic regression testing across the swap — frameworks like ReCatcher compare a current model against a candidate update across logical correctness, code quality, and performance, and find correctness and error-handling the most regression-prone areas when models change (arXiv:2507.19390) — and shadow deployment, running the candidate on a sample of live traffic before committing, to catch the distribution shift your golden set didn't anticipate. When prompts don't port cleanly, automated per-model prompt optimization can close the gap; Amazon's Promptimus reports gains of 3.18%–90.27% across seven models and nine tasks with no manual prompt engineering (Amazon Science).
Reference design: the model-agnostic stack
The pattern is contract-first: business logic talks to a versioned prompt and a typed output contract — never a provider SDK directly; an abstraction layer routes to the model; and an eval harness gates every change.

Figure: model-agnostic stack reference architecture.
Layer | Responsibility | Build it with |
Typed output contract | The swap-safe boundary | Pydantic / JSON Schema + constrained decoding; a bounded repair loop |
Prompt registry | Versioned, capability-tier prompts | Git-versioned directory or LangSmith |
Abstraction layer | Provider-agnostic calls, fallback, cost tracking | |
Provider pool | Tiers + data-boundary routing | Bedrock in-VPC (sensitive) + external (non-sensitive) |
Eval harness | Golden sets, LLM-judge, slice/regression, shadow | |
Release gate | Decide on per-cohort deltas, not aggregate | A CI quality gate |
Model-swap runbook:
Freeze the contract and prompt version — swap only the model, so any regression is isolated to it.
Run the candidate against the golden set for every task class.
LLM-judge equivalence against the incumbent, using a different model family (judge-reliability caveats).
Slice / regression detection for per-cohort deltas.
Shadow-deploy on 5–10% of live traffic to catch distribution shift.
Promote only if every per-cohort delta passes — otherwise block. The aggregate score is never the gate.
Adaptation tactics
Make schemas the contract. Typed, validated outputs make the model swappable by construction; never let a downstream step parse freeform prose.
Gate releases on per-cohort deltas, not aggregate scores — that's where swap regressions hide.
Run an LLM judge with a different model family, and treat the question as substitutability, not quality.
Shadow-deploy before you switch. Real traffic finds what your golden set missed.
With routing in place and swaps made safe, the next question is what runs the agent itself — the framework that orchestrates the work.
Next → Part 3: Agent Frameworks Compared.
References
Reproducibility vs. GPU/batch configuration — up to 9% variance at temp=0 (arXiv:2506.09501, 2025)
Limited test-retest reliability at temperature=0 (arXiv:2502.20747, 2025)
ReCatcher — LLM regression testing across model updates (arXiv:2507.19390, 2025)
SciDesignBench — best model 29.0%, 520 tasks / 14 domains (arXiv:2603.12724, 2026)

