top of page

Building a Multi-Agent Orchestration System on AWS

  • TomT
  • Jun 7
  • 5 min read

Updated: Jun 11

A multi-agent orchestration system routes diverse incoming requests to the right specialist and then executes each as a multi-step agentic task. The mistake most teams make is using one framework for both jobs and getting the worst of each. This is Part 4 of The Token Scarcity Playbook — assembling routing (Part 1), model-agnostic design (Part 2), and the right harness (Part 3) into one production system.

"AgentCore is the execution environment; the framework is the orchestration logic. They compose."

Table of Contents

  • The concerns a production system handles

  • The runtime: AgentCore

  • Connecting tools: AgentCore Gateway

  • State and memory: AgentCore Memory

  • Routing vs. the harness

  • The model-abstraction layer

  • A reference architecture

  • Adaptation tactics


The concerns a production system handles

A production multi-agent system has to handle far more than orchestration. It needs a runtime to host long, bursty, tool-heavy sessions; tool integration to reach real systems; state and memory so agents persist and remember; model abstraction so it can route per task and swap models; and observability to debug and evaluate. This stack assembles each of those — one section per concern below.

But the concern teams most often get wrong is orchestration, because it isn't one job — it's two structurally different ones, and forcing a single framework to do both is the classic mistake:

  • Problem A — request routing (wide and shallow): multiple frontends send diverse requests; classify intent and dispatch to the right tool or agent.

  • Problem B — the agent harness (narrow and deep): once routed, a complex task runs through planning, sub-agent spawning, long context, and many sequential tool calls.

Force one framework to do both and you get a router that's clumsy at long tasks, or a long-task harness bolted awkwardly onto intake. Keep them separate — then wrap both in the shared concerns: AgentCore for the runtime, Gateway for tools, AgentCore Memory for state and memory, LiteLLM for model abstraction, and LangSmith for observability. The rest of this part builds exactly that.


The runtime: AgentCore

Both problems still need somewhere to run, and that's where Amazon Bedrock AgentCore fits — a managed runtime, not a framework. It launched in Oct 2025 (AWS), and three properties make it well-suited to token-scarce agentic workloads.

Each session runs in a dedicated microVM with isolated CPU, memory, and filesystem (AWS docs). Sessions persist up to 8 hours — solving the 15-minute Lambda timeout that has long killed long-horizon agents. And billing is consumption-based with a detail that matters for agentic work: CPU charges stop during I/O wait, so the long stretches an agent spends waiting on a model or a tool don't bill CPU (AWS pricing). Crucially, it works with any framework (LangGraph, CrewAI, Strands) and any model, inside or outside Bedrock — so AgentCore doesn't compete with your framework choice from Part 3; it hosts it.


Connecting tools: AgentCore Gateway

Agents are only as useful as the tools they can reach, and AgentCore Gateway turns existing services into agent tools with minimal code, converting REST APIs into MCP (Model Context Protocol) servers and supporting OpenAPI and Smithy models plus Lambda functions (AWS). It provides dual-sided authentication — inbound OAuth to the gateway, outbound IAM/API-key/OAuth to the target service — so tool access is governed on both ends. One distinction worth holding onto: the Gateway routes between tools, not between models. It is not your model-routing layer; don't conflate the two routing problems.


State and memory: AgentCore Memory

Long-horizon agents are only useful if they remember, and persisting that state yourself means operating a vector store or database. AgentCore Memory is the managed alternative: short-term memory to track the immediate conversation, and long-term memory that extracts and stores durable knowledge — user preferences, semantic facts, summaries — for retrieval across sessions, with semantic search over stored records and encryption at rest and in transit (AWS — AgentCore Memory · retrieve API). It's callable from any framework through the MemoryClient API, and AWS Strands wires it in natively via the AgentCoreMemorySessionManager (AWS — Strands SDK Memory). Paired with the runtime, it gives the system durable state and cross-session memory as a managed service rather than infrastructure you run.


Routing vs. the harness

With the runtime and tools in place, map the two problems onto two tools. For Problem A (routing), LangGraph's conditional edges are purpose-built: a supervisor node classifies intent and dispatches across many tool domains, and every routing decision is a named, testable, version-controlled node. For Problem B (the harness), a long-horizon framework like deepagents (Part 3) handles planning, sub-agent spawning, and context offload for a single deep task. The supervisor is the wide fan-out; the harness is the deep dive. They nest cleanly: a LangGraph supervisor dispatches to specialist agents, some of which are deep harnesses, others of which are simple ReAct nodes for a CRM lookup or a notification.


The model-abstraction layer

Everything from Part 2 lands here as one component: LiteLLM as a model-abstraction layer between your orchestration and the providers — a unified, OpenAI-compatible interface across 100+ providers, with fallback, per-request cost tracking, and budget enforcement (GitHub; docs). Your nodes call LiteLLM; LiteLLM routes to Bedrock or elsewhere by policy. This is also where data-boundary decisions live: keep regulated workloads on Bedrock-native models inside your VPC, and only send non-sensitive tasks to external providers. Adding or swapping a provider then requires no change to orchestration code — model-agnostic by construction.

For cross-framework estates, the A2A protocol — now past 150 organizations (Linux Foundation) — lets agents built in different frameworks call each other, and langchain-mcp-adapters converts MCP tools into LangChain tools your LangGraph nodes can call directly (GitHub).


A reference architecture

Put together, the stack reads top to bottom — a wide supervisor for Problem A, a deep specialist pool for Problem B, a shared tool gateway, and a model-abstraction layer underneath:


Figure: reference architecture.


A scale data point for the harness layer: NVIDIA documented taking LangGraph agents from a single user to 1,000 in production (NVIDIA), and AWS and NVIDIA published a reference build combining NeMo, AgentCore, and Strands (AWS) — evidence the pattern holds at scale.


Adaptation tactics

  • Separate routing from the harness. A wide supervisor (LangGraph) and a deep harness (deepagents) are different tools; don't force one to be both.

  • Treat the runtime as orthogonal. AgentCore hosts any framework and any model — pick the framework on its merits (Part 3), then host it.

  • Put a model-abstraction layer in the middle. LiteLLM makes routing and model-swapping a policy change, and is where your VPC/data boundary is enforced.

  • Let CPU-on-I/O-wait billing work for you — long agentic sessions stop punishing you for the time they spend waiting.

That closes the playbook: the token scarcity era made "use the biggest model for everything" untenable, and the response is a disciplined stack — route deliberately, stay model-agnostic, choose the right harness, and assemble it on a runtime built for long, bursty, tool-heavy work. The same speed as the subsidy era, with the thinking added back.


This concludes The Token Scarcity Playbook. Start at → Part 0: The Token Scarcity Era.


References


Recent Posts

See All
bottom of page