Nothing makes an AI app feel sluggish faster than a blank screen while the model composes a long answer. A well-tuned model might take several seconds to finish a paragraph, and much longer for a detailed response. If your interface simply waits for the full completion before showing anything, users stare at a spinner and assume something is broken. Streaming fixes this. Instead of waiting for the entire response, you display tokens as they arrive, so text appears almost immediately and flows in naturally.
This guide covers why streaming improves perceived performance, how to implement it on the server using the official Anthropic SDKs, and how to consume it cleanly on the client. The techniques apply whether you are building a chat interface, an agent, or a document tool.
Why streaming matters
The key insight is the difference between actual latency and perceived latency. The total time to generate a response does not change when you stream. What changes is the time to first token, the moment the user sees something happening. Dropping that from several seconds to a few hundred milliseconds transforms how responsive the app feels, even though the full answer takes exactly as long.
Streaming brings several concrete benefits:
- Faster feedback. Users get visual confirmation that the system is working within the first moment, not after the full generation completes.
- Natural reading pace. Text often arrives at roughly the speed a person reads, so users start consuming the answer before it finishes.
- Early cancellation. If the response is heading in the wrong direction, users can stop it early instead of waiting for content they will discard.
- Progress for long outputs. For lengthy generations, streaming is the difference between a usable experience and an apparent hang.
How streaming works over the wire
Under the hood, the Anthropic API delivers streamed responses as a sequence of server-sent events. Rather than a single JSON blob at the end, you receive a series of typed events: the message starting, content blocks beginning, incremental text deltas, content blocks ending, and finally the message stopping. The official SDKs abstract most of this away, but it helps to know that each chunk is a small delta you append to what you have already shown.
One practical note: models with adaptive thinking may emit separate thinking content before the final answer. You can choose whether to surface that reasoning to users or hold it back and only stream the answer text. Handle content blocks by type so you can route each stream to the right place in your UI.
Server-side implementation with Python
The Python SDK exposes a streaming helper that yields events as they arrive. Set your ANTHROPIC_API_KEY environment variable, then use the stream context manager. Here is a minimal example using Claude Sonnet 4.6:
from anthropic import Anthropic
client = Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain streaming responses in two paragraphs."}
],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
final_message = stream.get_final_message()
The text_stream iterator gives you just the incremental text, which is the common case for chat UIs. When you need finer control (for example, to distinguish thinking blocks from answer blocks), iterate over the raw events instead and switch on the event type.
Forwarding the stream to a web client
In most real apps, the model call happens on your server and the tokens need to reach a browser. The clean pattern is to keep your API key server-side and re-stream the output to the client over your own endpoint, typically as server-sent events. Here is a sketch using TypeScript and the Node SDK with a standard streaming response:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
export async function handler(req: Request): Promise<Response> {
const encoder = new TextEncoder();
const stream = new ReadableStream({
async start(controller) {
const messageStream = client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [
{ role: "user", content: "Give me three tips for clean code." },
],
});
messageStream.on("text", (text) => {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ text })}\n\n`)
);
});
messageStream.on("end", () => {
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
});
messageStream.on("error", (err) => {
controller.error(err);
});
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
},
});
}
This keeps credentials on the server, gives you a place to enforce auth and rate limits, and lets you reshape the payload before it reaches the browser. Never call the model API directly from client code with your key exposed.
Consuming the stream in the browser
On the client, read the response body as a stream and append each chunk to your rendered output. Using the Fetch API and a reader keeps things dependency-free:
async function streamChat(prompt) {
const res = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt }),
});
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split("\n\n");
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith("data: ")) continue;
const payload = line.slice(6);
if (payload === "[DONE]") return;
const { text } = JSON.parse(payload);
appendToUI(text);
}
}
}
The buffering matters. Network chunks do not always align with event boundaries, so accumulate raw bytes and split on the event delimiter rather than assuming each read is a complete event.
Handling the rough edges
Streaming introduces a few concerns you should plan for rather than discover in production:
- Cancellation. Give users a stop button. On the client, abort the fetch with an
AbortController. On the server, that dropped connection should stop the upstream model call so you are not billed for tokens no one will see. - Errors mid-stream. A failure can happen after you have already shown partial text. Decide whether to keep the partial output with an error notice or roll it back. Surface a clear message either way.
- Rendering markdown safely. If you render markdown, re-parse the accumulated text on each update rather than parsing fragments, and sanitize output to avoid injection.
- Adaptive thinking. On Claude 4.6 and newer models you can enable adaptive thinking with
thinking: {type: "adaptive"}. If you do, expect thinking content in the stream and decide whether to display it, collapse it behind a toggle, or discard it before the answer streams. - Buffering by proxies. Some reverse proxies buffer responses and defeat streaming. Disable buffering for your streaming route so chunks flush immediately.
Choosing a model for streaming apps
Streaming works the same across the Claude lineup, so pick based on the task rather than the transport. For fast, high-volume interactions where responsiveness is paramount, Claude Haiku 4.5 is a strong fit. For most general chat and reasoning, Claude Sonnet 4.6 balances capability and speed. For the hardest problems, Claude Opus 4.8 and the most capable model, Claude Fable 5, are available, and Opus, Sonnet, and Fable all offer a 1M-token context window (Haiku offers 200K). Regardless of choice, streaming makes the longer generations feel far more responsive.
Takeaway
Streaming does not make your model faster, but it makes your app feel dramatically faster by cutting time to first token and letting users read as content arrives. Keep your API key server-side, re-stream to the client over server-sent events, buffer carefully when parsing chunks, and plan for cancellation and mid-stream errors. Get these fundamentals right and you turn a silent wait into a responsive, trustworthy experience.
