A single large language model can handle a surprising amount of work. But some problems are genuinely too complex, too long, or too parallelizable for one model call to handle well. Multi-agent workflows, where specialized agents collaborate under an orchestrator, are the practical answer to that ceiling. The hard part is not building the agents. It is coordinating them without creating a fragile, expensive mess.
This guide covers the decision criteria, the architectural patterns, and the failure modes that matter most in production systems.
When Multiple Agents Are Worth the Complexity
Before you build an agent swarm, be honest about whether you need one. The overhead is real: more API calls, more latency, more ways for things to go wrong. Add agents when you face at least one of these conditions.
- Context window overflow. Even with the 1M-token context windows available on Claude Fable 5 (
claude-fable-5), Claude Opus 4.8 (claude-opus-4-8), and Claude Sonnet 4.6 (claude-sonnet-4-6), some workloads, like processing an entire codebase plus documentation plus test suites simultaneously, still exceed practical limits. Splitting the work across agents that each see a relevant slice is a clean solution. - Independent parallel subtasks. If your pipeline has steps that do not depend on each other, running them in parallel with separate agents cuts wall-clock time significantly.
- Genuinely different expertise profiles. A code-review agent tuned on static analysis patterns performs differently from a security-audit agent tuned on vulnerability patterns. Specialization through system prompts and tool access, not just model selection, produces better results than one generalist agent trying to do everything.
- Long-horizon tasks with checkpoints. Tasks that span many steps benefit from agents that hand off verified intermediate outputs rather than one model accumulating unbounded context and drift.
If none of these apply, a single well-prompted model call is almost always faster, cheaper, and easier to debug.
Core Architecture: Orchestrator and Workers
The most durable pattern is a two-tier design: one orchestrator agent that plans and routes, and multiple worker agents that execute specialized subtasks.
The Orchestrator
The orchestrator receives the top-level goal, decomposes it into subtasks, dispatches those subtasks to the right workers, collects results, and assembles the final output. It should not do heavy domain work itself. Its job is coordination.
Use a capable model here. Claude Fable 5 (claude-fable-5) is a strong choice for orchestrators that need deep reasoning about task decomposition, since poor decomposition cascades into poor results everywhere downstream.
Keep the orchestrator’s system prompt focused on planning and delegation. Give it a structured output format, typically JSON, so downstream parsing is deterministic.
Worker Agents
Workers receive a well-scoped subtask and the context they need to complete it. They should not need to understand the full pipeline. Each worker has its own system prompt, its own tool access, and ideally its own model selection based on the nature of the work.
For high-volume, lower-complexity worker tasks, Claude Haiku 4.5 (claude-haiku-4-5) keeps costs manageable with a 200K-token context window that is sufficient for most subtasks. For tasks requiring deeper reasoning, step up to Sonnet or Opus.
State and Message Passing
Agents communicate through messages, not shared memory. Define a message schema early and stick to it. A typical message includes: the task description, the input data, any relevant prior context, and the expected output format. Ambiguous handoffs are one of the most common sources of silent failure in multi-agent systems.
# Minimal Python example using the anthropic SDK
import anthropic
import json
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from environment
def run_worker(task: dict) -> dict:
response = client.messages.create(
model="claude-haiku-4-5",
max_tokens=2048,
system="You are a precise data extraction agent. Return only valid JSON.",
messages=[
{"role": "user", "content": json.dumps(task)}
]
)
return json.loads(response.content[0].text)
def run_orchestrator(goal: str) -> list:
response = client.messages.create(
model="claude-fable-5",
max_tokens=4096,
thinking={"type": "adaptive"},
system="You are a task orchestrator. Decompose the goal into subtasks. Return a JSON array of task objects.",
messages=[
{"role": "user", "content": goal}
]
)
return json.loads(response.content[-1].text)
Note the thinking: {type: "adaptive"} parameter on the orchestrator call. On Claude 4.6 and later models, adaptive thinking replaces the old fixed budget approach. The model allocates reasoning effort based on task complexity rather than a hard token ceiling you have to guess at upfront.
Connecting Agents to External Context with MCP
Many real-world agents need access to tools, databases, or APIs, not just the content passed in the message. The Model Context Protocol (MCP) is an open standard that provides a consistent interface for connecting AI models to external tools and data sources.
Rather than writing bespoke tool integrations for every agent, MCP-compatible agents can discover and use tools through a shared protocol. This matters in multi-agent systems because it means your orchestrator and your workers can share tool definitions, reducing duplication and keeping behavior consistent across the pipeline.
If you are building agents that need to read from databases, call internal APIs, or interact with filesystems, investing in MCP compatibility early pays dividends as the agent count grows.
Failure Modes to Plan For
Multi-agent systems fail in ways that single-model pipelines do not. Here are the ones that show up most often in practice.
Cascading Errors
A bad output from an early worker propagates silently into every downstream agent that depends on it. By the time the final output looks wrong, the original error is buried in intermediate results. Build validation checkpoints between stages. Have the orchestrator verify that worker outputs meet the expected schema before passing them forward.
Context Accumulation and Drift
As agents pass results forward, the accumulated context grows. Workers late in the pipeline receive bloated inputs full of intermediate reasoning that may contradict or confuse their task. Be selective about what you pass forward. Summarize intermediate outputs rather than appending raw completions.
Ambiguous Task Boundaries
If two workers have overlapping responsibilities, they will either duplicate work or each assume the other handled something critical. Define task boundaries precisely in the orchestrator’s decomposition logic. When in doubt, err toward more explicit scope definitions rather than fewer.
Runaway Loops
Agents that can spawn sub-agents, or that retry on failure without a hard ceiling, can generate runaway API costs. Set explicit maximum iteration counts and maximum spawn depth at the orchestrator level. Treat these limits as non-negotiable hard stops, not soft suggestions.
Latency Compounding
Sequential agent chains add latency at every step. A five-step pipeline where each step takes two seconds is a ten-second minimum, before retries. Map out which steps genuinely depend on prior results and run everything that can run in parallel, in parallel. Use async patterns in your orchestration code.
# TypeScript parallel dispatch example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic(); // reads ANTHROPIC_API_KEY from environment
async function runWorkers(tasks: string[]): Promise {
const promises = tasks.map((task) =>
client.messages.create({
model: "claude-haiku-4-5",
max_tokens: 1024,
messages: [{ role: "user", content: task }],
})
);
const results = await Promise.all(promises);
return results.map((r) => (r.content[0] as { text: string }).text);
}
Observability Is Not Optional
You cannot debug what you cannot observe. Every agent call should log at minimum: the model used, the input token count, the output token count, the latency, and whether the output passed validation. Aggregate these logs at the pipeline level so you can see which agents are the cost and latency bottlenecks.
Trace IDs that flow across all agents in a single pipeline run are essential. Without them, correlating a bad final output back to the worker that caused it requires guesswork.
Consider structured logging from the start. Retrofitting observability into a running multi-agent system is painful.
A Note on Model Selection Across the Pipeline
Not every agent needs the most capable model. A useful heuristic: use the most capable model at the orchestration layer where task decomposition and judgment matter most. Step down to faster, more economical models for workers doing well-scoped, repetitive, or format-heavy tasks. Reserve the step back up to more capable models for workers handling tasks that require nuanced reasoning or where errors are especially costly.
The 1M-token context window on Fable 5, Opus 4.8, and Sonnet 4.6 gives you room for rich context at the orchestrator level without forcing you to trim aggressively. Haiku 4.5’s 200K window covers the vast majority of worker-level tasks without compromise.
Takeaway
Multi-agent workflows earn their complexity when your problem genuinely requires parallelism, specialization, or scale beyond what a single model call can handle. The patterns are straightforward: a focused orchestrator, well-scoped workers, strict message schemas, and validation between stages. The failures are equally predictable: cascading errors, runaway loops, and invisible latency debt. Build the observability in before you need it, set hard limits on iteration and recursion, and choose your models deliberately at each layer. Done right, multi-agent coordination turns problems that seemed intractable into pipelines that are straightforward to reason about and extend.
