Streaming LLM Responses: Server-Sent Events, Chunked Transfer, and the UX of Waiting
The 8-Second Problem
Your chatbot takes 6 seconds to think. Without streaming, your user stares at a blank screen, wonders if it crashed, and refreshes. You've lost them.
With streaming, the first token appears in 400 milliseconds. The same 6-second response now feels instant because the user is reading along as the model writes. Total time is identical. Perceived latency drops by 80%.
This isn't a minor polish item. Streaming is the single biggest UX improvement you can make to any LLM application. Every major chat interface — ChatGPT, Claude, Gemini — streams tokens for exactly this reason. And the technology behind it? A humble HTTP standard from 2006 called Server-Sent Events that found its killer app two decades later.
In this post, we'll build the complete streaming pipeline: from the raw bytes on the wire, through Python API clients and FastAPI relay servers, all the way to smooth browser rendering with a blinking cursor. We'll cover the edge cases that separate tutorials from production, compare latency across models, and build an interactive demo where you can watch streaming in action.
How Streaming Actually Works: The Wire Protocol
Before touching any SDK, let's see what streaming looks like at the HTTP level. When you set stream: true on an LLM API call, the server doesn't return a single JSON response. Instead, it opens a long-lived HTTP connection and sends data incrementally using Server-Sent Events (SSE).
SSE is dead simple. The server sets Content-Type: text/event-stream, and then sends a series of text messages. Each message has a data: field followed by a payload, and messages are separated by a blank line (\n\n):
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Transfer-Encoding: chunked
# First SSE message — a token arrives
data: {"choices":[{"delta":{"content":"Hello"}}]}
# Second SSE message — another token
data: {"choices":[{"delta":{"content":" world"}}]}
# Stream terminates
data: [DONE]
That's it. Each data: line carries one chunk (usually one token), and the double newline tells the client "this message is complete, process it." The SSE spec also defines optional fields:
event:— names the event type (Anthropic uses this; OpenAI doesn't)id:— event ID for resuming after disconnectionretry:— tells the client how long to wait before reconnecting (in ms)
Why SSE instead of WebSockets? Because LLM streaming is one-directional. The server streams tokens to the client — the client doesn't need to send anything back mid-stream. SSE is simpler, works over standard HTTP, passes through proxies and load balancers without special configuration, and supports automatic reconnection. WebSockets are full-duplex and binary-capable, which is overkill here.
Under the hood, SSE uses HTTP/1.1's Transfer-Encoding: chunked (the server doesn't know the total response size upfront). On HTTP/2, chunked encoding doesn't exist — the protocol's binary framing handles it natively. This matters because HTTP/1.1 browsers limit you to 6 concurrent SSE connections per domain, while HTTP/2 allows 100+ concurrent streams on a single TCP connection.
Consuming Streams from LLM APIs
Let's move from theory to code. Both OpenAI and Anthropic support streaming, but their event formats differ in instructive ways.
OpenAI: delta objects
OpenAI streams ChatCompletionChunk objects. The key difference from a normal response: instead of message.content, you get delta.content — just the new token, not the full text so far. You concatenate them yourself:
from openai import OpenAI
client = OpenAI()
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain streaming in 3 sentences."}],
stream=True,
stream_options={"include_usage": True}, # get token counts
)
full_response = ""
for chunk in stream:
# The final chunk has empty choices but carries usage data
if chunk.choices:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
full_response += delta
# Usage arrives in the second-to-last chunk
if chunk.usage:
print(f"\n[Tokens: {chunk.usage.total_tokens}]")
On the wire, OpenAI sends only data: lines — no event: field. Each line contains a JSON chat.completion.chunk object. The stream ends with the literal string data: [DONE] (not valid JSON — the SDK handles this for you).
Anthropic: typed event stream
Anthropic takes a more structured approach. Each SSE message has both an event: type and a data: payload, giving you a clear state machine:
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain streaming in 3 sentences."}],
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# After the stream closes, get full message with usage
message = stream.get_final_message()
print(f"\n[Tokens: {message.usage.input_tokens + message.usage.output_tokens}]")
Under the hood, Anthropic's stream emits a sequence of typed events: message_start (opens the message), content_block_start (opens a text block), a series of content_block_delta events (each carrying a text_delta with a token), content_block_stop, message_delta (carries stop_reason and cumulative usage), and message_stop. There's no [DONE] sentinel — message_stop is the termination signal.
The raw view: what the SDK hides
SDKs are convenient, but they hide the SSE parsing. Here's what it looks like to consume a streaming response with raw HTTP, using httpx:
import httpx
import json
def stream_raw(prompt: str, api_key: str):
"""Consume OpenAI streaming response with raw HTTP."""
with httpx.stream(
"POST",
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
},
json={
"model": "gpt-4o",
"messages": [{"role": "user", "content": prompt}],
"stream": True,
},
) as response:
buffer = ""
for raw_bytes in response.iter_bytes():
buffer += raw_bytes.decode("utf-8")
# SSE messages are separated by double newlines
while "\n\n" in buffer:
message, buffer = buffer.split("\n\n", 1)
for line in message.split("\n"):
if line.startswith("data: "):
payload = line[6:] # strip "data: " prefix
if payload == "[DONE]":
return
chunk = json.loads(payload)
delta = chunk["choices"][0]["delta"]
if "content" in delta:
print(delta["content"], end="", flush=True)
The key insight: network chunks don't align with SSE messages. A single iter_bytes() read might contain half a message, or three messages. You accumulate text in a buffer and split on \n\n to extract complete messages. This is exactly the parsing every SSE library does internally.
The Server Relay: FastAPI as a Streaming Proxy
In most real applications, the browser doesn't talk directly to OpenAI or Anthropic. Your backend sits in between — adding authentication, logging, rate limiting, and content filtering. The challenge: your backend receives a stream from the LLM API and must relay it to the browser without buffering the entire response first.
FastAPI's StreamingResponse makes this straightforward with an async generator:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
import json
app = FastAPI()
client = AsyncOpenAI()
async def relay_stream(prompt: str, request: Request):
"""Relay OpenAI stream to browser as SSE."""
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
)
try:
async for chunk in stream:
# If the client disconnected, stop burning tokens
if await request.is_disconnected():
await stream.close()
return
delta = chunk.choices[0].delta.content if chunk.choices else None
if delta:
yield f"data: {json.dumps({'text': delta})}\n\n"
yield "data: [DONE]\n\n"
except Exception:
await stream.close()
raise
@app.post("/api/chat")
async def chat(request: Request):
body = await request.json()
return StreamingResponse(
relay_stream(body["prompt"], request),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
A few critical details here:
- Client disconnect detection — The
request.is_disconnected()check is essential. If a user closes the tab mid-stream, you want to cancel the upstream LLM request immediately. Without this, you keep receiving (and paying for) tokens that nobody will read. On a long response, that's real money. X-Accel-Buffering: no— Nginx (and many reverse proxies) buffer responses by default. This header tells the proxy to pass chunks through immediately. Without it, your carefully streamed tokens arrive in one big batch — defeating the entire purpose.Cache-Control: no-cache— Prevents intermediate caches from holding the response. SSE connections must be passed through in real time.
For production SSE with more features (automatic keep-alive pings, named events, retry headers), check out thesse-starlettelibrary. ItsEventSourceResponsehandles SSE formatting automatically and adds configurable ping intervals to keep connections alive through aggressive proxy timeouts.
Browser Consumption: From Bytes to UI
The stream reaches the browser. Now we need to parse it and render tokens smoothly. There are two approaches, and the choice matters.
EventSource: the simple path
The browser has a built-in SSE client called EventSource. It handles parsing, automatic reconnection, and event dispatch:
const source = new EventSource("/api/chat?prompt=Hello");
source.onmessage = (event) => {
if (event.data === "[DONE]") {
source.close();
return;
}
const { text } = JSON.parse(event.data);
document.getElementById("output").textContent += text;
};
source.onerror = () => source.close();
Clean and simple — but EventSource has a fatal limitation for LLM apps: it only supports GET requests. You can't POST a JSON body with your conversation history. You're limited to stuffing everything into URL query parameters, which caps out around 2,000 characters. That's one or two messages of context, at best.
fetch + ReadableStream: the production path
The modern approach uses fetch() with a ReadableStream. You get full control over the request (POST, custom headers, JSON body) and parse SSE yourself:
async function streamChat(prompt) {
const response = await fetch("/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ prompt }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Split on double newline — the SSE message boundary
const messages = buffer.split("\n\n");
buffer = messages.pop(); // last element is incomplete — keep buffering
for (const msg of messages) {
const line = msg.trim();
if (!line.startsWith("data: ")) continue;
const payload = line.slice(6);
if (payload === "[DONE]") return;
const { text } = JSON.parse(payload);
appendToken(text); // your rendering function
}
}
}
The critical line is messages.pop(). Network chunks don't respect SSE boundaries — a single reader.read() might return half a message. By splitting on \n\n and keeping the last fragment in the buffer, we handle partial messages correctly.
Smooth rendering with requestAnimationFrame
A naive appendToken that updates the DOM on every token causes visible jank, especially at high token rates (100+ tokens/sec). The fix: batch DOM updates to the browser's paint cycle:
let tokenQueue = [];
let animating = false;
function appendToken(text) {
tokenQueue.push(text);
if (!animating) {
animating = true;
requestAnimationFrame(flushTokens);
}
}
function flushTokens() {
if (tokenQueue.length === 0) {
animating = false;
return;
}
// Flush all queued tokens in one DOM write
const output = document.getElementById("output");
output.textContent += tokenQueue.join("");
tokenQueue = [];
requestAnimationFrame(flushTokens);
}
This batches all tokens that arrive between paint frames into a single DOM update. At 60fps, that's one update every ~16ms — smooth for the user, efficient for the browser.
The Hard Parts: Streaming Edge Cases
Tutorials make streaming look easy. Production makes it hard. Here are the edge cases that will bite you.
Streaming structured output
When your LLM returns JSON, it arrives character by character: {, then "na, then me":, then "Al... You can't parse until the full JSON is complete. The pattern: accumulate the entire response, then parse after the stream ends.
let jsonBuffer = "";
for await (const token of stream) {
jsonBuffer += token;
// Optionally: try parsing on each token for progressive display
try {
const partial = JSON.parse(jsonBuffer);
renderPartialResult(partial); // display what we have so far
} catch {
// Not valid JSON yet — keep accumulating
}
}
const result = JSON.parse(jsonBuffer); // final, complete parse
The try/catch on every token might seem expensive, but JSON.parse is fast and this lets you show partial results progressively — displaying extracted fields as they become complete. (This ties back to our structured output post — streaming and structured output are inherently in tension.)
Streaming with tool use
When an agent model decides to call a tool, the stream switches modes. You've been receiving text tokens, then suddenly you get a tool call object. With Anthropic, this is a new content_block_start with type: "tool_use", followed by input_json_delta events that stream the tool arguments as partial JSON. You must buffer the entire tool call, execute the tool, then feed the result back to continue generation. (See our agents post for the full pattern.)
Cancellation propagation
When the user hits "Stop generating," you need to propagate the cancellation through every layer:
- Browser:
reader.cancel()orAbortController.abort() - Server: detect the closed connection, cancel the upstream API call
- LLM API: stop generating tokens (you stop paying immediately)
If any link in this chain breaks, you're burning tokens and money on a response nobody will read. The FastAPI relay code above handles this with request.is_disconnected() — but test it. Many production setups have a proxy layer (Nginx, Cloudflare) that keeps the upstream connection alive even after the client disconnects.
Token counting mid-stream
For real-time cost display, you need to count tokens as they stream. OpenAI provides usage data in the final chunk if you set stream_options.include_usage. Anthropic sends cumulative usage in the message_delta event. For mid-stream estimates before those arrive, count tokens client-side using a rough heuristic: ~0.75 tokens per word for English text, or use a tokenizer library like tiktoken on your backend.
Latency Across Models: TTFT and ITL
Two metrics define the streaming experience:
- TTFT (Time to First Token) — How long until the first token appears. This is what the user perceives as "response time." Determined by prompt processing time, model size, and server queue depth.
- ITL (Inter-Token Latency) — Time between consecutive tokens. Determines how smooth the streaming looks. At <10ms ITL, text appears to flow; at >50ms, it visibly stutters.
Here's how popular models compare (data from Artificial Analysis, live benchmarks):
| Model | TTFT | Output Speed | Per-Token |
|---|---|---|---|
| GPT-4o | 0.45s | 145 tok/s | ~6.9ms |
| GPT-4o mini | ~0.35s | ~200 tok/s | ~5ms |
| Claude 4.5 Sonnet | 1.24s | 71 tok/s | ~14ms |
| Claude Sonnet 4.6 | 0.85s | 54 tok/s | ~18.6ms |
| Gemini 2.0 Flash | ~0.34s | ~300 tok/s | ~3.3ms |
| Llama 4 Scout (Groq) | ~0.33s | ~2600 tok/s | ~0.4ms |
A few things jump out. The small, optimized models (Gemini Flash, Llama on Groq) are absurdly fast — Llama on Groq streams at 2,600 tokens/second, faster than most people can read. The frontier models (GPT-4o, Claude Sonnet) are slower but still well within the "feels smooth" range at 50-150 tokens/second.
The key insight: streaming doesn't make the model faster. The total generation time is the same whether you stream or not. What streaming does is move the user's perceived wait from TTFT + total_generation to just TTFT. A 500-token response from Claude 4.5 Sonnet takes ~8.3 seconds either way — but with streaming, the user starts reading after 1.2 seconds instead of waiting 8.3.
Try It: Streaming Playground
Watch tokens stream in real time. Compare streaming vs. no streaming to see the UX difference.
Putting It All Together
Here's the complete pipeline, end to end:
# The streaming journey of a single token:
Browser POST /api/chat {"prompt": "..."}
|
v
FastAPI receives request, calls LLM API with stream=True
|
v
LLM generates token "Hello" → SSE chunk: data: {"text":"Hello"}\n\n
|
v
FastAPI yields chunk through StreamingResponse
|
v
Reverse proxy (nginx) passes through (X-Accel-Buffering: no)
|
v
Browser fetch() → ReadableStream → reader.read() → decode → parse
|
v
appendToken("Hello") → requestAnimationFrame → DOM update → user sees "Hello"
When to use what:
- EventSource — Quick prototypes, GET-only endpoints, situations where you don't need auth headers
- fetch + ReadableStream — Production apps, POST requests, custom auth, full control
- StreamingResponse — Simple relay with manual SSE formatting, zero dependencies
- sse-starlette — Production backend SSE with auto keep-alive, named events, client disconnect handling
And when does streaming not matter? Batch processing. If you're classifying 10,000 documents overnight, nobody's watching. Streaming adds per-token overhead that actually slows total throughput. For batch workloads, disable streaming and optimize for throughput. Streaming and batching are complementary tools: one optimizes latency, the other throughput.
Similarly, if you're building a caching layer, you need the complete response before writing to cache. The pattern is "stream to the user, buffer in parallel, cache after completion." And if you're evaluating LLM systems, streaming adds noise to timing measurements — measure TTFT and total time separately for meaningful benchmarks.
Streaming is the connective tissue between "I can call an LLM" and "I can build something that feels good to use." The protocol is simple. The edge cases are where the real engineering lives. Now go make something stream.
References & Further Reading
- MDN — Using Server-Sent Events — comprehensive browser API reference for SSE
- WHATWG HTML Spec — Server-Sent Events — the official specification
- OpenAI — Chat Completions Streaming — streaming chunk format and options
- Anthropic — Streaming Messages — event types and delta format
- sse-starlette — production-ready SSE for FastAPI/Starlette
- Artificial Analysis — LLM Speed Leaderboard — live TTFT and output speed benchmarks
- Simon Willison — How streaming LLM APIs work — clear comparison of OpenAI vs Anthropic streaming
- MDN — Using Readable Streams — the fetch + ReadableStream pattern