top of page

Making Applications Model-Agnostic by Design

  • TomT
  • May 7
  • 6 min read

A model-agnostic application is one you can move from one model to another — to capture a cheaper option or route per task — without rewriting logic or shipping silent regressions. Routing (Part 1) only pays off if your system survives the swaps. This is Part 2 of The Token Scarcity Playbook.

"The schema is the contract; the model is an implementation detail."

Table of Contents


Why model swaps break things

Traditional software testing rests on an assumption LLMs violate: that the same input yields the same output. It doesn't — not even at temperature zero.

This is well documented in recent work. The instability isn't only sampling: changing the serving configuration — GPU count, GPU type, batch size — shifts outputs too. A 2025 study showed that under bf16 greedy decoding (temperature zero), a reasoning model exhibited up to 9% accuracy variation and a 9,000-token swing in response length purely from infrastructure differences (arXiv:2506.09501). And repeating identical prompts five times at temperature zero still produced inconsistent results across multiple models — limited test-retest reliability even with sampling disabled (arXiv:2502.20747).

If a single model wobbles this much, swapping between different models — with different verbosity, formatting, and instruction-following defaults — is a far larger jump. The job of a model-agnostic design is to absorb that jump.


Designing for agnosticism

The core move is to make the model an implementation detail behind a stable contract. Three rules carry most of the weight.

First, define every interface as a typed schema, not freeform text. If model A and model B both return {"action": "search", "query": "..."}, the next step in your pipeline doesn't care how differently they reasoned to get there. The schema is the contract; the model is swappable underneath it.

Second, write prompts to a capability tier, not a model's quirks. A prompt that assumes "a model that follows multi-step instructions and emits JSON" ports across providers; one that leans on a single model's idiosyncratic behavior does not. Verbose, explicit instructions travel better than clever short ones.

Third, separate the prompt from model selection. Keep prompts versioned in a registry so that when you swap models, the prompt is a controlled variable — you can test whether the existing prompt works on the new model instead of changing both at once.

The shape that results: business logic never touches a provider SDK directly; it emits a model-agnostic prompt against a versioned template, a router picks the tier, and a schema validator guarantees the output contract regardless of which model produced it.

Figure: model-agnostic application architecture.

Implementation notes. Define the contract with Pydantic or JSON Schema and enforce it with strict/constrained decoding where the provider supports it; on a validation miss, run a bounded repair loop before escalating a tier. Keep prompts in a real registry (a git-versioned directory or a tool like LangSmith) keyed by task_class + version, never inline in app code. Put a thin abstraction (LiteLLM or your own adapter) between business logic and providers so a swap is a config change, not a code change.

Grouping models into tiers

In practice teams organize candidate models into a few interchangeable tiers — a fast/cheap tier, a balanced tier, and a flagship tier — and route within them. The tiering is a useful framing, not a law: interchangeability is always relative to a task. Two models that are interchangeable for code generation may diverge badly on structured legal extraction.

What makes two models genuinely interchangeable for a task is measurable: comparable schema-conformance on your output format, semantic equivalence on a golden set (better checked with an LLM-as-judge equivalence test than with string overlap — see the testing stack below), task-success parity within a few percent, and — critically — overlapping failure modes. Two models can match on average and still fail on completely different inputs, which is exactly the regression a swap introduces.


The benchmarks that still differentiate

Standard benchmarks no longer help here: MMLU and HumanEval are saturated above 90% across frontier models. The benchmarks that still separate models in 2026 are harder and more specific:

  • BFCL (Berkeley Function-Calling Leaderboard) for tool-use accuracy — the failure mode that actually breaks agents, and a continuously updated live leaderboard (Gorilla).

  • SciDesignBench (2026), scientist-grounded inverse-design tasks — where the best zero-shot model scores just 29.0% across 520 tasks in 14 domains, a vivid reminder that "saturated" is benchmark-specific (arXiv:2603.12724).

For agentic interchangeability, BFCL is the one to watch: tool-call reliability is what breaks pipelines, not trivia recall.


The testing stack

Catching swap regressions takes a layered evaluation pipeline, not a single accuracy number. Treat it as four gates, each of which can block the release on its own:

Figure: the four-gate model-swap validation pipeline.

It starts with a golden dataset built from real production failures, one per task class, not synthetic examples. The release decision should hinge on per-cohort deltas, not the aggregate: an overall pass rate creeping from 0.91 to 0.93 can hide a specific cohort collapsing from 0.94 to 0.83 — exactly the regression that reaches users.

On top of that sits LLM-as-judge for semantic equivalence — fast and cheap, but to be used with eyes open. A late-2025 study found even top judges (GPT-5, Gemini 2.5 Pro) fail to hold consistent preferences in nearly a quarter of hard cases, and that human annotation itself is inconsistent enough to be a shaky "gold standard" (arXiv:2512.16041). The same work points to what actually helps: explicit rubrics, panel-of-judges aggregation, and fine-tuned judges — and you should always judge with a different model family than the one under test to avoid self-enhancement bias. For swap validation the question isn't "is this good," it's "is this equivalent enough to substitute."

Two more layers harden it: systematic regression testing across the swap — frameworks like ReCatcher compare a current model against a candidate update across logical correctness, code quality, and performance, and find correctness and error-handling the most regression-prone areas when models change (arXiv:2507.19390) — and shadow deployment, running the candidate on a sample of live traffic before committing, to catch the distribution shift your golden set didn't anticipate. When prompts don't port cleanly, automated per-model prompt optimization can close the gap; Amazon's Promptimus reports gains of 3.18%–90.27% across seven models and nine tasks with no manual prompt engineering (Amazon Science).


Reference design: the model-agnostic stack

The pattern is contract-first: business logic talks to a versioned prompt and a typed output contract — never a provider SDK directly; an abstraction layer routes to the model; and an eval harness gates every change.


Figure: model-agnostic stack reference architecture.

Layer

Responsibility

Build it with

Typed output contract

The swap-safe boundary

Pydantic / JSON Schema + constrained decoding; a bounded repair loop

Prompt registry

Versioned, capability-tier prompts

Git-versioned directory or LangSmith

Abstraction layer

Provider-agnostic calls, fallback, cost tracking

Provider pool

Tiers + data-boundary routing

Bedrock in-VPC (sensitive) + external (non-sensitive)

Eval harness

Golden sets, LLM-judge, slice/regression, shadow

LLM judge (different family); regression testing (ReCatcher); Langfuse shadow runs

Release gate

Decide on per-cohort deltas, not aggregate

A CI quality gate

Model-swap runbook:

  1. Freeze the contract and prompt version — swap only the model, so any regression is isolated to it.

  2. Run the candidate against the golden set for every task class.

  3. LLM-judge equivalence against the incumbent, using a different model family (judge-reliability caveats).

  4. Slice / regression detection for per-cohort deltas.

  5. Shadow-deploy on 5–10% of live traffic to catch distribution shift.

  6. Promote only if every per-cohort delta passes — otherwise block. The aggregate score is never the gate.


Adaptation tactics

  • Make schemas the contract. Typed, validated outputs make the model swappable by construction; never let a downstream step parse freeform prose.

  • Gate releases on per-cohort deltas, not aggregate scores — that's where swap regressions hide.

  • Run an LLM judge with a different model family, and treat the question as substitutability, not quality.

  • Shadow-deploy before you switch. Real traffic finds what your golden set missed.

With routing in place and swaps made safe, the next question is what runs the agent itself — the framework that orchestrates the work.



References

Recent Posts

See All
bottom of page