If you’ve ever used ChatGPT or Claude, you’ve seen it: text that appears word by word, character by character, creating that satisfying “typewriter” effect. But when you try to implement this in your own application, you quickly discover it’s not as simple as calling an API and rendering the response.
Streaming LLM responses is one of the most common challenges developers face when building AI-powered applications in 2024-2025. The difference between a sluggish app that makes users wait 10+ seconds for a complete response and a responsive interface that starts showing content immediately can mean the difference between user adoption and abandonment.
In this comprehensive guide, we’ll dive deep into the mechanics of streaming LLM responses, from the underlying protocols to production-ready implementations. Whether you’re building with OpenAI, Anthropic, Ollama, or any other LLM provider, the patterns you’ll learn here are universally applicable.
Why Streaming Matters: The Psychology of Perceived Performance
Before we dive into the technical implementation, let’s understand why streaming is so critical.
The Cold, Hard Numbers
When a user sends a prompt to an LLM, the time to first token (TTFT) and total generation time can vary dramatically:
| Model | Avg. TTFT | Total Response Time (500 tokens) |
|---|---|---|
| GPT-4 Turbo | 0.5-1.5s | 8-15s |
| Claude 3 Opus | 0.8-2.0s | 10-20s |
| GPT-3.5 Turbo | 0.2-0.5s | 3-6s |
| Llama 3 70B (local) | 0.1-0.3s | 15-45s |
Without streaming, your users stare at a loading spinner for the entire response time. With streaming, they start seeing content within the TTFT window—a 10-20x improvement in perceived responsiveness.
Cognitive Load and User Trust
Research in UX psychology shows that:
- Users perceive streaming interfaces as 40% faster than buffered responses, even when total time is identical
- Progressive content display reduces perceived wait time and user anxiety
- The “typewriter effect” creates a sense of the AI “thinking,” which paradoxically increases trust
Understanding the Streaming Pipeline
Let’s break down what happens when you stream an LLM response:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ LLM API │───▶│ Backend │───▶│ Transport │───▶│ Frontend │
│ (OpenAI) │ │ Server │ │ Protocol │ │ (React) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
Tokens Chunks SSE/WS DOM Updates
Each stage has its own considerations:
- LLM API Layer: The provider sends tokens as they’re generated
- Backend Server: Transforms API responses into a format suitable for your transport
- Transport Protocol: How you send data to the browser (SSE, WebSockets, or HTTP streaming)
- Frontend Rendering: Efficiently updating the DOM without causing jank
Part 1: Server-Sent Events (SSE) — The Industry Standard
Server-Sent Events (SSE) is the de facto standard for LLM streaming. It’s what OpenAI, Anthropic, and most LLM APIs use natively. Let’s understand why and how to implement it.
What is SSE?
SSE is a web standard that allows servers to push data to web clients over HTTP. Unlike WebSockets, it’s:
- Unidirectional: Server → Client only
- HTTP-based: Works through proxies, CDNs, and load balancers
-
Auto-reconnecting: Built-in reconnection with
Last-Event-ID - Text-based: Each message is a text event
The SSE Protocol Format
An SSE stream follows a specific text format:
event: message
data: {"content": "Hello"}
event: message
data: {"content": " world"}
event: done
data: [DONE]
Key rules:
- Each field is on its own line:
field: value - Messages are separated by double newlines (
nn) - The
data:field contains your payload (usually JSON) - Optional
event:,id:, andretry:fields
Node.js Backend Implementation
Let’s build a production-ready SSE endpoint that proxies OpenAI’s streaming API:
// server.js - Express + OpenAI Streaming
import express from 'express';
import OpenAI from 'openai';
import cors from 'cors';
const app = express();
app.use(cors());
app.use(express.json());
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
app.post('/api/chat/stream', async (req, res) => {
const { messages, model = 'gpt-4-turbo-preview' } = req.body;
// Set SSE headers
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
res.setHeader('X-Accel-Buffering', 'no'); // Disable nginx buffering
// Flush headers immediately
res.flushHeaders();
try {
const stream = await openai.chat.completions.create({
model,
messages,
stream: true,
stream_options: { include_usage: true }, // Get token usage in stream
});
let totalTokens = 0;
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
const finishReason = chunk.choices[0]?.finish_reason;
// Track usage if available (last chunk contains usage)
if (chunk.usage) {
totalTokens = chunk.usage.total_tokens;
}
if (content) {
// Send content chunk
res.write(`data: ${JSON.stringify({
type: 'content',
content
})}nn`);
}
if (finishReason) {
// Send completion signal with metadata
res.write(`data: ${JSON.stringify({
type: 'done',
finishReason,
usage: { totalTokens }
})}nn`);
}
}
} catch (error) {
console.error('Stream error:', error);
// Send error event
res.write(`data: ${JSON.stringify({
type: 'error',
message: error.message
})}nn`);
} finally {
res.end();
}
});
app.listen(3001, () => {
console.log('Server running on http://localhost:3001');
});
Critical Implementation Details
Several subtle issues can break your streaming implementation:
1. Proxy and Load Balancer Buffering
Nginx, Cloudflare, and many reverse proxies buffer responses by default. This destroys streaming. Add these headers:
res.setHeader('X-Accel-Buffering', 'no'); // Nginx
res.setHeader('Cache-Control', 'no-cache, no-transform'); // CDNs
For Cloudflare specifically, you may need to disable “Auto Minify” and enable “Chunked Transfer Encoding” in your dashboard.
2. Connection Timeouts
Long-running streams can hit timeout limits. Implement heartbeats:
// Send heartbeat every 15 seconds to keep connection alive
const heartbeat = setInterval(() => {
res.write(': heartbeatnn'); // SSE comment, ignored by clients
}, 15000);
// Clean up on close
req.on('close', () => {
clearInterval(heartbeat);
});
3. Backpressure Handling
If the client can’t consume data fast enough, you need to handle backpressure:
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
const data = `data: ${JSON.stringify({ type: 'content', content })}nn`;
// Check if the write buffer is full
const canContinue = res.write(data);
if (!canContinue) {
// Wait for drain event before continuing
await new Promise(resolve => res.once('drain', resolve));
}
}
}
Part 2: Frontend Implementation with React
Now let’s build a React frontend that consumes our SSE stream with proper error handling, cancellation, and a smooth UI.
The Core Streaming Hook
// hooks/useStreamingChat.ts
import { useState, useCallback, useRef } from 'react';
interface Message {
role: 'user' | 'assistant';
content: string;
}
interface StreamState {
isStreaming: boolean;
error: string | null;
usage: { totalTokens: number } | null;
}
export function useStreamingChat() {
const [messages, setMessages] = useState<Message[]>([]);
const [streamState, setStreamState] = useState<StreamState>({
isStreaming: false,
error: null,
usage: null,
});
const abortControllerRef = useRef<AbortController | null>(null);
const sendMessage = useCallback(async (userMessage: string) => {
// Cancel any existing stream
abortControllerRef.current?.abort();
abortControllerRef.current = new AbortController();
const newMessages: Message[] = [
...messages,
{ role: 'user', content: userMessage },
];
setMessages([...newMessages, { role: 'assistant', content: '' }]);
setStreamState({ isStreaming: true, error: null, usage: null });
try {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages: newMessages }),
signal: abortControllerRef.current.signal,
});
if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${response.statusText}`);
}
const reader = response.body?.getReader();
if (!reader) throw new Error('No response body');
const decoder = new TextDecoder();
let buffer = '';
let assistantContent = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Process complete SSE messages
const lines = buffer.split('nn');
buffer = lines.pop() || ''; // Keep incomplete message in buffer
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const jsonStr = line.slice(6); // Remove 'data: ' prefix
if (jsonStr === '[DONE]') continue;
try {
const data = JSON.parse(jsonStr);
if (data.type === 'content') {
assistantContent += data.content;
setMessages(prev => {
const updated = [...prev];
updated[updated.length - 1] = {
role: 'assistant',
content: assistantContent,
};
return updated;
});
} else if (data.type === 'done') {
setStreamState(prev => ({
...prev,
usage: data.usage,
}));
} else if (data.type === 'error') {
throw new Error(data.message);
}
} catch (parseError) {
console.warn('Failed to parse SSE message:', line);
}
}
}
} catch (error) {
if ((error as Error).name === 'AbortError') {
// User cancelled, not an error
return;
}
setStreamState(prev => ({
...prev,
error: (error as Error).message,
}));
// Remove empty assistant message on error
setMessages(prev => prev.slice(0, -1));
} finally {
setStreamState(prev => ({ ...prev, isStreaming: false }));
}
}, [messages]);
const cancelStream = useCallback(() => {
abortControllerRef.current?.abort();
setStreamState(prev => ({ ...prev, isStreaming: false }));
}, []);
const clearMessages = useCallback(() => {
setMessages([]);
setStreamState({ isStreaming: false, error: null, usage: null });
}, []);
return {
messages,
streamState,
sendMessage,
cancelStream,
clearMessages,
};
}
Rendering Streaming Content Efficiently
When content updates 10-50 times per second during streaming, naive React rendering can cause performance issues. Here’s how to optimize:
// components/MessageContent.tsx
import { memo, useMemo } from 'react';
import { marked } from 'marked';
import DOMPurify from 'dompurify';
interface MessageContentProps {
content: string;
isStreaming: boolean;
}
// Memoize to prevent re-renders from parent updates
export const MessageContent = memo(function MessageContent({
content,
isStreaming,
}: MessageContentProps) {
// Only parse markdown when not streaming (expensive operation)
const renderedContent = useMemo(() => {
if (isStreaming) {
// During streaming, just show raw text with basic formatting
return content.split('n').map((line, i) => (
<span key={i}>
{line}
{i < content.split('n').length - 1 && <br />}
</span>
));
}
// After streaming complete, render full markdown
const html = marked(content, { breaks: true, gfm: true });
const sanitized = DOMPurify.sanitize(html);
return <div dangerouslySetInnerHTML={{ __html: sanitized }} />;
}, [content, isStreaming]);
return (
<div className="message-content">
{renderedContent}
{isStreaming && <span className="cursor-blink">▊</span>}
</div>
);
});
The Complete Chat Component
// components/StreamingChat.tsx
import { useState, useRef, useEffect, FormEvent } from 'react';
import { useStreamingChat } from '../hooks/useStreamingChat';
import { MessageContent } from './MessageContent';
export function StreamingChat() {
const [input, setInput] = useState('');
const messagesEndRef = useRef<HTMLDivElement>(null);
const inputRef = useRef<HTMLTextAreaElement>(null);
const {
messages,
streamState,
sendMessage,
cancelStream,
clearMessages,
} = useStreamingChat();
// Auto-scroll to bottom on new messages
useEffect(() => {
messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
}, [messages]);
// Auto-resize textarea
useEffect(() => {
if (inputRef.current) {
inputRef.current.style.height = 'auto';
inputRef.current.style.height = `${inputRef.current.scrollHeight}px`;
}
}, [input]);
const handleSubmit = async (e: FormEvent) => {
e.preventDefault();
if (!input.trim() || streamState.isStreaming) return;
const message = input.trim();
setInput('');
await sendMessage(message);
};
const handleKeyDown = (e: React.KeyboardEvent) => {
if (e.key === 'Enter' && !e.shiftKey) {
e.preventDefault();
handleSubmit(e);
}
};
return (
<div className="chat-container">
{/* Messages Area */}
<div className="messages-area">
{messages.length === 0 ? (
<div className="empty-state">
<h2>Start a conversation</h2>
<p>Send a message to begin chatting with AI</p>
</div>
) : (
messages.map((message, index) => (
<div
key={index}
className={`message message-${message.role}`}
>
<div className="message-avatar">
{message.role === 'user' ? '👤' : '🤖'}
</div>
<MessageContent
content={message.content}
isStreaming={
streamState.isStreaming &&
index === messages.length - 1 &&
message.role === 'assistant'
}
/>
</div>
))
)}
<div ref={messagesEndRef} />
</div>
{/* Error Display */}
{streamState.error && (
<div className="error-banner">
<span>⚠️ {streamState.error}</span>
<button onClick={clearMessages}>Dismiss</button>
</div>
)}
{/* Token Usage */}
{streamState.usage && (
<div className="usage-info">
Tokens used: {streamState.usage.totalTokens}
</div>
)}
{/* Input Area */}
<form onSubmit={handleSubmit} className="input-area">
<textarea
ref={inputRef}
value={input}
onChange={(e) => setInput(e.target.value)}
onKeyDown={handleKeyDown}
placeholder="Type a message... (Shift+Enter for new line)"
disabled={streamState.isStreaming}
rows={1}
/>
{streamState.isStreaming ? (
<button type="button" onClick={cancelStream} className="cancel-btn">
⏹ Stop
</button>
) : (
<button type="submit" disabled={!input.trim()}>
Send →
</button>
)}
</form>
</div>
);
}
Part 3: Alternative Approaches — When SSE Isn’t Enough
While SSE covers 90% of use cases, some scenarios require different approaches.
WebSockets for Bidirectional Communication
Use WebSockets when you need:
- Real-time interrupts: Allow users to stop generation mid-stream
- Multiplexing: Multiple concurrent streams over one connection
- Bidirectional control: Server-initiated status updates
// WebSocket streaming implementation
class StreamingWebSocket {
private ws: WebSocket;
private messageHandlers = new Map<string, (data: any) => void>();
constructor(url: string) {
this.ws = new WebSocket(url);
this.ws.onmessage = (event) => {
const data = JSON.parse(event.data);
const handler = this.messageHandlers.get(data.streamId);
if (handler) handler(data);
};
}
async streamChat(
messages: Message[],
onChunk: (content: string) => void
): Promise<{ streamId: string; cancel: () => void }> {
const streamId = crypto.randomUUID();
this.messageHandlers.set(streamId, (data) => {
if (data.type === 'content') {
onChunk(data.content);
}
});
this.ws.send(JSON.stringify({
type: 'start_stream',
streamId,
messages,
}));
return {
streamId,
cancel: () => {
this.ws.send(JSON.stringify({ type: 'cancel', streamId }));
this.messageHandlers.delete(streamId);
},
};
}
}
HTTP/2 Server Push
If your infrastructure supports HTTP/2, you can leverage server push for even lower latency. The key advantage is multiplexing multiple streams over a single TCP connection without head-of-line blocking.
Using the Vercel AI SDK
For production applications, consider the Vercel AI SDK, which abstracts away much of this complexity:
// Using Vercel AI SDK
import { useChat } from 'ai/react';
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
api: '/api/chat',
});
return (
<div>
{messages.map(m => (
<div key={m.id}>{m.role}: {m.content}</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} />
<button type="submit" disabled={isLoading}>Send</button>
</form>
</div>
);
}
Part 4: Production Considerations
Error Recovery and Retry Logic
Implement exponential backoff for transient failures:
async function streamWithRetry(
messages: Message[],
maxRetries = 3
): AsyncGenerator<string> {
let attempt = 0;
while (attempt < maxRetries) {
try {
yield* streamFromAPI(messages);
return; // Success, exit
} catch (error) {
attempt++;
if (attempt >= maxRetries) throw error;
// Exponential backoff: 1s, 2s, 4s
const delay = Math.pow(2, attempt - 1) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
Rate Limiting and Queue Management
When multiple users are streaming simultaneously:
// Simple token bucket rate limiter
class RateLimiter {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number,
private refillRate: number // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<boolean> {
this.refill();
if (this.tokens >= 1) {
this.tokens -= 1;
return true;
}
return false;
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.maxTokens,
this.tokens + elapsed * this.refillRate
);
this.lastRefill = now;
}
}
Monitoring and Observability
Track these key metrics:
- Time to First Token (TTFT): When users first see content
- Tokens per Second (TPS): Generation speed
- Stream Completion Rate: Percentage of streams that complete without error
- Connection Duration: How long streams stay open
// Prometheus-style metrics
import { Counter, Histogram } from 'prom-client';
const streamDuration = new Histogram({
name: 'llm_stream_duration_seconds',
help: 'Duration of LLM streaming requests',
buckets: [0.5, 1, 2, 5, 10, 30, 60],
});
const tokensGenerated = new Counter({
name: 'llm_tokens_generated_total',
help: 'Total tokens generated',
});
const streamErrors = new Counter({
name: 'llm_stream_errors_total',
help: 'Total streaming errors',
labelNames: ['error_type'],
});
Cost Optimization
Streaming doesn’t reduce API costs, but you can optimize:
- Token estimation: Predict output length and warn users before expensive operations
- Caching: Cache common responses (with appropriate invalidation)
- Model selection: Use cheaper models for simple tasks, expensive models for complex ones
// Intelligent model selection
function selectModel(prompt: string): string {
const estimatedComplexity = analyzePromptComplexity(prompt);
if (estimatedComplexity < 0.3) {
return 'gpt-3.5-turbo'; // $0.0005/1K tokens
} else if (estimatedComplexity < 0.7) {
return 'gpt-4-turbo-preview'; // $0.01/1K tokens
} else {
return 'gpt-4'; // $0.03/1K tokens
}
}
Part 5: Handling Edge Cases
Very Long Responses
For responses that might exceed browser memory:
// Virtual scrolling for very long responses
import { FixedSizeList as List } from 'react-window';
function VirtualizedMessage({ content }: { content: string }) {
const lines = content.split('n');
return (
<List
height={400}
itemCount={lines.length}
itemSize={24}
width="100%"
>
{({ index, style }) => (
<div style={style}>{lines[index]}</div>
)}
</List>
);
}
Code Blocks in Streaming Content
Detect incomplete code blocks to avoid syntax highlighting flicker:
function isCompleteCodeBlock(content: string): boolean {
const openBlocks = (content.match(/```
{% endraw %}
/g) || []).length;
return openBlocks % 2 === 0; // Even number means all blocks are closed
}
function MessageContent({ content, isStreaming }: Props) {
const shouldHighlight = !isStreaming || isCompleteCodeBlock(content);
// Only apply syntax highlighting when appropriate
const processed = shouldHighlight
? highlightCode(content)
: escapeHtml(content);
return <div dangerouslySetInnerHTML={{ __html: processed }} />;
}
{% raw %}
Mobile and Low-Bandwidth Considerations
typescript
// Adaptive chunking based on connection speed
function getChunkingStrategy(): 'immediate' | 'batched' {
if ('connection' in navigator) {
const connection = (navigator as any).connection;
if (connection.effectiveType === '2g' ||
connection.effectiveType === 'slow-2g') {
return 'batched'; // Reduce render frequency
}
}
return 'immediate';
}
Conclusion
Streaming LLM responses is no longer optional—it’s expected by users who’ve been trained by ChatGPT and Claude. The good news is that with the patterns in this guide, you can build streaming experiences that match or exceed what the major providers offer.
Key takeaways:
- SSE is your default choice: Simple, HTTP-based, works through proxies
- Handle backpressure and timeouts: Production environments are harsh
- Optimize rendering: Markdown parsing and DOM updates are expensive during streaming
- Monitor everything: TTFT, TPS, and error rates are your key metrics
- Plan for edge cases: Long responses, code blocks, and mobile users need special handling
The code examples in this guide are production-tested patterns. Take them, adapt them to your stack, and build AI experiences that feel magical.
Happy streaming! 🚀
💡 Note: This article was originally published on the Pockit Blog.
