← Back to Blog

Streaming LLM Responses: Server-Sent Events, Chunked Transfer, and the UX of Waiting

The 8-Second Problem

Your chatbot takes 6 seconds to think. Without streaming, your user stares at a blank screen, wonders if it crashed, and refreshes. You've lost them.

With streaming, the first token appears in 400 milliseconds. The same 6-second response now feels instant because the user is reading along as the model writes. Total time is identical. Perceived latency drops by 80%.

This isn't a minor polish item. Streaming is the single biggest UX improvement you can make to any LLM application. Every major chat interface — ChatGPT, Claude, Gemini — streams tokens for exactly this reason. And the technology behind it? A humble HTTP standard from 2006 called Server-Sent Events that found its killer app two decades later.

In this post, we'll build the complete streaming pipeline: from the raw bytes on the wire, through Python API clients and FastAPI relay servers, all the way to smooth browser rendering with a blinking cursor. We'll cover the edge cases that separate tutorials from production, compare latency across models, and build an interactive demo where you can watch streaming in action.

How Streaming Actually Works: The Wire Protocol

Before touching any SDK, let's see what streaming looks like at the HTTP level. When you set stream: true on an LLM API call, the server doesn't return a single JSON response. Instead, it opens a long-lived HTTP connection and sends data incrementally using Server-Sent Events (SSE).

SSE is dead simple. The server sets Content-Type: text/event-stream, and then sends a series of text messages. Each message has a data: field followed by a payload, and messages are separated by a blank line (\n\n):

# HTTP response headers
HTTP/1.1 200 OK
Content-Type: text/event-stream
Cache-Control: no-cache
Transfer-Encoding: chunked

# First SSE message — a token arrives
data: {"choices":[{"delta":{"content":"Hello"}}]}

# Second SSE message — another token
data: {"choices":[{"delta":{"content":" world"}}]}

# Stream terminates
data: [DONE]

That's it. Each data: line carries one chunk (usually one token), and the double newline tells the client "this message is complete, process it." The SSE spec also defines optional fields:

Why SSE instead of WebSockets? Because LLM streaming is one-directional. The server streams tokens to the client — the client doesn't need to send anything back mid-stream. SSE is simpler, works over standard HTTP, passes through proxies and load balancers without special configuration, and supports automatic reconnection. WebSockets are full-duplex and binary-capable, which is overkill here.

Under the hood, SSE uses HTTP/1.1's Transfer-Encoding: chunked (the server doesn't know the total response size upfront). On HTTP/2, chunked encoding doesn't exist — the protocol's binary framing handles it natively. This matters because HTTP/1.1 browsers limit you to 6 concurrent SSE connections per domain, while HTTP/2 allows 100+ concurrent streams on a single TCP connection.

Consuming Streams from LLM APIs

Let's move from theory to code. Both OpenAI and Anthropic support streaming, but their event formats differ in instructive ways.

OpenAI: delta objects

OpenAI streams ChatCompletionChunk objects. The key difference from a normal response: instead of message.content, you get delta.content — just the new token, not the full text so far. You concatenate them yourself:

from openai import OpenAI

client = OpenAI()
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain streaming in 3 sentences."}],
    stream=True,
    stream_options={"include_usage": True},  # get token counts
)

full_response = ""
for chunk in stream:
    # The final chunk has empty choices but carries usage data
    if chunk.choices:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
            full_response += delta

    # Usage arrives in the second-to-last chunk
    if chunk.usage:
        print(f"\n[Tokens: {chunk.usage.total_tokens}]")

On the wire, OpenAI sends only data: lines — no event: field. Each line contains a JSON chat.completion.chunk object. The stream ends with the literal string data: [DONE] (not valid JSON — the SDK handles this for you).

Anthropic: typed event stream

Anthropic takes a more structured approach. Each SSE message has both an event: type and a data: payload, giving you a clear state machine:

import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain streaming in 3 sentences."}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# After the stream closes, get full message with usage
message = stream.get_final_message()
print(f"\n[Tokens: {message.usage.input_tokens + message.usage.output_tokens}]")

Under the hood, Anthropic's stream emits a sequence of typed events: message_start (opens the message), content_block_start (opens a text block), a series of content_block_delta events (each carrying a text_delta with a token), content_block_stop, message_delta (carries stop_reason and cumulative usage), and message_stop. There's no [DONE] sentinel — message_stop is the termination signal.

The raw view: what the SDK hides

SDKs are convenient, but they hide the SSE parsing. Here's what it looks like to consume a streaming response with raw HTTP, using httpx:

import httpx
import json

def stream_raw(prompt: str, api_key: str):
    """Consume OpenAI streaming response with raw HTTP."""
    with httpx.stream(
        "POST",
        "https://api.openai.com/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/json",
        },
        json={
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": prompt}],
            "stream": True,
        },
    ) as response:
        buffer = ""
        for raw_bytes in response.iter_bytes():
            buffer += raw_bytes.decode("utf-8")
            # SSE messages are separated by double newlines
            while "\n\n" in buffer:
                message, buffer = buffer.split("\n\n", 1)
                for line in message.split("\n"):
                    if line.startswith("data: "):
                        payload = line[6:]  # strip "data: " prefix
                        if payload == "[DONE]":
                            return
                        chunk = json.loads(payload)
                        delta = chunk["choices"][0]["delta"]
                        if "content" in delta:
                            print(delta["content"], end="", flush=True)

The key insight: network chunks don't align with SSE messages. A single iter_bytes() read might contain half a message, or three messages. You accumulate text in a buffer and split on \n\n to extract complete messages. This is exactly the parsing every SSE library does internally.

The Server Relay: FastAPI as a Streaming Proxy

In most real applications, the browser doesn't talk directly to OpenAI or Anthropic. Your backend sits in between — adding authentication, logging, rate limiting, and content filtering. The challenge: your backend receives a stream from the LLM API and must relay it to the browser without buffering the entire response first.

FastAPI's StreamingResponse makes this straightforward with an async generator:

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from openai import AsyncOpenAI
import json

app = FastAPI()
client = AsyncOpenAI()

async def relay_stream(prompt: str, request: Request):
    """Relay OpenAI stream to browser as SSE."""
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
    )
    try:
        async for chunk in stream:
            # If the client disconnected, stop burning tokens
            if await request.is_disconnected():
                await stream.close()
                return
            delta = chunk.choices[0].delta.content if chunk.choices else None
            if delta:
                yield f"data: {json.dumps({'text': delta})}\n\n"
        yield "data: [DONE]\n\n"
    except Exception:
        await stream.close()
        raise

@app.post("/api/chat")
async def chat(request: Request):
    body = await request.json()
    return StreamingResponse(
        relay_stream(body["prompt"], request),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

A few critical details here:

For production SSE with more features (automatic keep-alive pings, named events, retry headers), check out the sse-starlette library. Its EventSourceResponse handles SSE formatting automatically and adds configurable ping intervals to keep connections alive through aggressive proxy timeouts.

Browser Consumption: From Bytes to UI

The stream reaches the browser. Now we need to parse it and render tokens smoothly. There are two approaches, and the choice matters.

EventSource: the simple path

The browser has a built-in SSE client called EventSource. It handles parsing, automatic reconnection, and event dispatch:

const source = new EventSource("/api/chat?prompt=Hello");

source.onmessage = (event) => {
    if (event.data === "[DONE]") {
        source.close();
        return;
    }
    const { text } = JSON.parse(event.data);
    document.getElementById("output").textContent += text;
};

source.onerror = () => source.close();

Clean and simple — but EventSource has a fatal limitation for LLM apps: it only supports GET requests. You can't POST a JSON body with your conversation history. You're limited to stuffing everything into URL query parameters, which caps out around 2,000 characters. That's one or two messages of context, at best.

fetch + ReadableStream: the production path

The modern approach uses fetch() with a ReadableStream. You get full control over the request (POST, custom headers, JSON body) and parse SSE yourself:

async function streamChat(prompt) {
    const response = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ prompt }),
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let buffer = "";

    while (true) {
        const { value, done } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });

        // Split on double newline — the SSE message boundary
        const messages = buffer.split("\n\n");
        buffer = messages.pop();  // last element is incomplete — keep buffering

        for (const msg of messages) {
            const line = msg.trim();
            if (!line.startsWith("data: ")) continue;

            const payload = line.slice(6);
            if (payload === "[DONE]") return;

            const { text } = JSON.parse(payload);
            appendToken(text);  // your rendering function
        }
    }
}

The critical line is messages.pop(). Network chunks don't respect SSE boundaries — a single reader.read() might return half a message. By splitting on \n\n and keeping the last fragment in the buffer, we handle partial messages correctly.

Smooth rendering with requestAnimationFrame

A naive appendToken that updates the DOM on every token causes visible jank, especially at high token rates (100+ tokens/sec). The fix: batch DOM updates to the browser's paint cycle:

let tokenQueue = [];
let animating = false;

function appendToken(text) {
    tokenQueue.push(text);
    if (!animating) {
        animating = true;
        requestAnimationFrame(flushTokens);
    }
}

function flushTokens() {
    if (tokenQueue.length === 0) {
        animating = false;
        return;
    }
    // Flush all queued tokens in one DOM write
    const output = document.getElementById("output");
    output.textContent += tokenQueue.join("");
    tokenQueue = [];
    requestAnimationFrame(flushTokens);
}

This batches all tokens that arrive between paint frames into a single DOM update. At 60fps, that's one update every ~16ms — smooth for the user, efficient for the browser.

The Hard Parts: Streaming Edge Cases

Tutorials make streaming look easy. Production makes it hard. Here are the edge cases that will bite you.

Streaming structured output

When your LLM returns JSON, it arrives character by character: {, then "na, then me":, then "Al... You can't parse until the full JSON is complete. The pattern: accumulate the entire response, then parse after the stream ends.

let jsonBuffer = "";
for await (const token of stream) {
    jsonBuffer += token;
    // Optionally: try parsing on each token for progressive display
    try {
        const partial = JSON.parse(jsonBuffer);
        renderPartialResult(partial);  // display what we have so far
    } catch {
        // Not valid JSON yet — keep accumulating
    }
}
const result = JSON.parse(jsonBuffer);  // final, complete parse

The try/catch on every token might seem expensive, but JSON.parse is fast and this lets you show partial results progressively — displaying extracted fields as they become complete. (This ties back to our structured output post — streaming and structured output are inherently in tension.)

Streaming with tool use

When an agent model decides to call a tool, the stream switches modes. You've been receiving text tokens, then suddenly you get a tool call object. With Anthropic, this is a new content_block_start with type: "tool_use", followed by input_json_delta events that stream the tool arguments as partial JSON. You must buffer the entire tool call, execute the tool, then feed the result back to continue generation. (See our agents post for the full pattern.)

Cancellation propagation

When the user hits "Stop generating," you need to propagate the cancellation through every layer:

  1. Browser: reader.cancel() or AbortController.abort()
  2. Server: detect the closed connection, cancel the upstream API call
  3. LLM API: stop generating tokens (you stop paying immediately)

If any link in this chain breaks, you're burning tokens and money on a response nobody will read. The FastAPI relay code above handles this with request.is_disconnected() — but test it. Many production setups have a proxy layer (Nginx, Cloudflare) that keeps the upstream connection alive even after the client disconnects.

Token counting mid-stream

For real-time cost display, you need to count tokens as they stream. OpenAI provides usage data in the final chunk if you set stream_options.include_usage. Anthropic sends cumulative usage in the message_delta event. For mid-stream estimates before those arrive, count tokens client-side using a rough heuristic: ~0.75 tokens per word for English text, or use a tokenizer library like tiktoken on your backend.

Latency Across Models: TTFT and ITL

Two metrics define the streaming experience:

Here's how popular models compare (data from Artificial Analysis, live benchmarks):

Model TTFT Output Speed Per-Token
GPT-4o 0.45s 145 tok/s ~6.9ms
GPT-4o mini ~0.35s ~200 tok/s ~5ms
Claude 4.5 Sonnet 1.24s 71 tok/s ~14ms
Claude Sonnet 4.6 0.85s 54 tok/s ~18.6ms
Gemini 2.0 Flash ~0.34s ~300 tok/s ~3.3ms
Llama 4 Scout (Groq) ~0.33s ~2600 tok/s ~0.4ms

A few things jump out. The small, optimized models (Gemini Flash, Llama on Groq) are absurdly fast — Llama on Groq streams at 2,600 tokens/second, faster than most people can read. The frontier models (GPT-4o, Claude Sonnet) are slower but still well within the "feels smooth" range at 50-150 tokens/second.

The key insight: streaming doesn't make the model faster. The total generation time is the same whether you stream or not. What streaming does is move the user's perceived wait from TTFT + total_generation to just TTFT. A 500-token response from Claude 4.5 Sonnet takes ~8.3 seconds either way — but with streaming, the user starts reading after 1.2 seconds instead of waiting 8.3.

Try It: Streaming Playground

Watch tokens stream in real time. Compare streaming vs. no streaming to see the UX difference.

Tokens 0
Speed 0 tok/s
Elapsed 0.0s
Est. Cost $0.000
TTFT
Inter-token latency (ms)
80 tok/s
15ms

Putting It All Together

Here's the complete pipeline, end to end:

# The streaming journey of a single token:

Browser POST /api/chat {"prompt": "..."}
    |
    v
FastAPI receives request, calls LLM API with stream=True
    |
    v
LLM generates token "Hello" → SSE chunk: data: {"text":"Hello"}\n\n
    |
    v
FastAPI yields chunk through StreamingResponse
    |
    v
Reverse proxy (nginx) passes through (X-Accel-Buffering: no)
    |
    v
Browser fetch() → ReadableStream → reader.read() → decode → parse
    |
    v
appendToken("Hello") → requestAnimationFrame → DOM update → user sees "Hello"

When to use what:

And when does streaming not matter? Batch processing. If you're classifying 10,000 documents overnight, nobody's watching. Streaming adds per-token overhead that actually slows total throughput. For batch workloads, disable streaming and optimize for throughput. Streaming and batching are complementary tools: one optimizes latency, the other throughput.

Similarly, if you're building a caching layer, you need the complete response before writing to cache. The pattern is "stream to the user, buffer in parallel, cache after completion." And if you're evaluating LLM systems, streaming adds noise to timing measurements — measure TTFT and total time separately for meaningful benchmarks.

Streaming is the connective tissue between "I can call an LLM" and "I can build something that feels good to use." The protocol is simple. The edge cases are where the real engineering lives. Now go make something stream.

References & Further Reading