The Complete Guide to Streaming LLM Responses in Web Applications: From SSE to Real-Time UI

If you’ve ever used ChatGPT or Claude, you’ve seen it: text that appears word by word, character by character, creating that satisfying “typewriter” effect. But when you try to implement this in your own application, you quickly discover it’s not as simple as calling an API and rendering the response.

Streaming LLM responses is one of the most common challenges developers face when building AI-powered applications in 2024-2025. The difference between a sluggish app that makes users wait 10+ seconds for a complete response and a responsive interface that starts showing content immediately can mean the difference between user adoption and abandonment.

In this comprehensive guide, we’ll dive deep into the mechanics of streaming LLM responses, from the underlying protocols to production-ready implementations. Whether you’re building with OpenAI, Anthropic, Ollama, or any other LLM provider, the patterns you’ll learn here are universally applicable.

Why Streaming Matters: The Psychology of Perceived Performance

Before we dive into the technical implementation, let’s understand why streaming is so critical.

The Cold, Hard Numbers

When a user sends a prompt to an LLM, the time to first token (TTFT) and total generation time can vary dramatically:

Model Avg. TTFT Total Response Time (500 tokens)
GPT-4 Turbo 0.5-1.5s 8-15s
Claude 3 Opus 0.8-2.0s 10-20s
GPT-3.5 Turbo 0.2-0.5s 3-6s
Llama 3 70B (local) 0.1-0.3s 15-45s

Without streaming, your users stare at a loading spinner for the entire response time. With streaming, they start seeing content within the TTFT window—a 10-20x improvement in perceived responsiveness.

Cognitive Load and User Trust

Research in UX psychology shows that:

  • Users perceive streaming interfaces as 40% faster than buffered responses, even when total time is identical
  • Progressive content display reduces perceived wait time and user anxiety
  • The “typewriter effect” creates a sense of the AI “thinking,” which paradoxically increases trust

Understanding the Streaming Pipeline

Let’s break down what happens when you stream an LLM response:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   LLM API   │───▶│   Backend   │───▶│  Transport  │───▶│   Frontend  │
│  (OpenAI)   │    │   Server    │    │  Protocol   │    │    (React)  │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
     Tokens            Chunks              SSE/WS           DOM Updates

Each stage has its own considerations:

  1. LLM API Layer: The provider sends tokens as they’re generated
  2. Backend Server: Transforms API responses into a format suitable for your transport
  3. Transport Protocol: How you send data to the browser (SSE, WebSockets, or HTTP streaming)
  4. Frontend Rendering: Efficiently updating the DOM without causing jank

Part 1: Server-Sent Events (SSE) — The Industry Standard

Server-Sent Events (SSE) is the de facto standard for LLM streaming. It’s what OpenAI, Anthropic, and most LLM APIs use natively. Let’s understand why and how to implement it.

What is SSE?

SSE is a web standard that allows servers to push data to web clients over HTTP. Unlike WebSockets, it’s:

  • Unidirectional: Server → Client only
  • HTTP-based: Works through proxies, CDNs, and load balancers
  • Auto-reconnecting: Built-in reconnection with Last-Event-ID
  • Text-based: Each message is a text event

The SSE Protocol Format

An SSE stream follows a specific text format:

event: message
data: {"content": "Hello"}

event: message
data: {"content": " world"}

event: done
data: [DONE]

Key rules:

  • Each field is on its own line: field: value
  • Messages are separated by double newlines (nn)
  • The data: field contains your payload (usually JSON)
  • Optional event:, id:, and retry: fields

Node.js Backend Implementation

Let’s build a production-ready SSE endpoint that proxies OpenAI’s streaming API:

// server.js - Express + OpenAI Streaming
import express from 'express';
import OpenAI from 'openai';
import cors from 'cors';

const app = express();
app.use(cors());
app.use(express.json());

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

app.post('/api/chat/stream', async (req, res) => {
  const { messages, model = 'gpt-4-turbo-preview' } = req.body;

  // Set SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.setHeader('X-Accel-Buffering', 'no'); // Disable nginx buffering

  // Flush headers immediately
  res.flushHeaders();

  try {
    const stream = await openai.chat.completions.create({
      model,
      messages,
      stream: true,
      stream_options: { include_usage: true }, // Get token usage in stream
    });

    let totalTokens = 0;

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content;
      const finishReason = chunk.choices[0]?.finish_reason;

      // Track usage if available (last chunk contains usage)
      if (chunk.usage) {
        totalTokens = chunk.usage.total_tokens;
      }

      if (content) {
        // Send content chunk
        res.write(`data: ${JSON.stringify({ 
          type: 'content', 
          content 
        })}nn`);
      }

      if (finishReason) {
        // Send completion signal with metadata
        res.write(`data: ${JSON.stringify({ 
          type: 'done', 
          finishReason,
          usage: { totalTokens }
        })}nn`);
      }
    }
  } catch (error) {
    console.error('Stream error:', error);

    // Send error event
    res.write(`data: ${JSON.stringify({ 
      type: 'error', 
      message: error.message 
    })}nn`);
  } finally {
    res.end();
  }
});

app.listen(3001, () => {
  console.log('Server running on http://localhost:3001');
});

Critical Implementation Details

Several subtle issues can break your streaming implementation:

1. Proxy and Load Balancer Buffering

Nginx, Cloudflare, and many reverse proxies buffer responses by default. This destroys streaming. Add these headers:

res.setHeader('X-Accel-Buffering', 'no');  // Nginx
res.setHeader('Cache-Control', 'no-cache, no-transform'); // CDNs

For Cloudflare specifically, you may need to disable “Auto Minify” and enable “Chunked Transfer Encoding” in your dashboard.

2. Connection Timeouts

Long-running streams can hit timeout limits. Implement heartbeats:

// Send heartbeat every 15 seconds to keep connection alive
const heartbeat = setInterval(() => {
  res.write(': heartbeatnn'); // SSE comment, ignored by clients
}, 15000);

// Clean up on close
req.on('close', () => {
  clearInterval(heartbeat);
});

3. Backpressure Handling

If the client can’t consume data fast enough, you need to handle backpressure:

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    const data = `data: ${JSON.stringify({ type: 'content', content })}nn`;

    // Check if the write buffer is full
    const canContinue = res.write(data);
    if (!canContinue) {
      // Wait for drain event before continuing
      await new Promise(resolve => res.once('drain', resolve));
    }
  }
}

Part 2: Frontend Implementation with React

Now let’s build a React frontend that consumes our SSE stream with proper error handling, cancellation, and a smooth UI.

The Core Streaming Hook

// hooks/useStreamingChat.ts
import { useState, useCallback, useRef } from 'react';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

interface StreamState {
  isStreaming: boolean;
  error: string | null;
  usage: { totalTokens: number } | null;
}

export function useStreamingChat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [streamState, setStreamState] = useState<StreamState>({
    isStreaming: false,
    error: null,
    usage: null,
  });

  const abortControllerRef = useRef<AbortController | null>(null);

  const sendMessage = useCallback(async (userMessage: string) => {
    // Cancel any existing stream
    abortControllerRef.current?.abort();
    abortControllerRef.current = new AbortController();

    const newMessages: Message[] = [
      ...messages,
      { role: 'user', content: userMessage },
    ];

    setMessages([...newMessages, { role: 'assistant', content: '' }]);
    setStreamState({ isStreaming: true, error: null, usage: null });

    try {
      const response = await fetch('/api/chat/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages: newMessages }),
        signal: abortControllerRef.current.signal,
      });

      if (!response.ok) {
        throw new Error(`HTTP ${response.status}: ${response.statusText}`);
      }

      const reader = response.body?.getReader();
      if (!reader) throw new Error('No response body');

      const decoder = new TextDecoder();
      let buffer = '';
      let assistantContent = '';

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });

        // Process complete SSE messages
        const lines = buffer.split('nn');
        buffer = lines.pop() || ''; // Keep incomplete message in buffer

        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;

          const jsonStr = line.slice(6); // Remove 'data: ' prefix
          if (jsonStr === '[DONE]') continue;

          try {
            const data = JSON.parse(jsonStr);

            if (data.type === 'content') {
              assistantContent += data.content;
              setMessages(prev => {
                const updated = [...prev];
                updated[updated.length - 1] = {
                  role: 'assistant',
                  content: assistantContent,
                };
                return updated;
              });
            } else if (data.type === 'done') {
              setStreamState(prev => ({
                ...prev,
                usage: data.usage,
              }));
            } else if (data.type === 'error') {
              throw new Error(data.message);
            }
          } catch (parseError) {
            console.warn('Failed to parse SSE message:', line);
          }
        }
      }
    } catch (error) {
      if ((error as Error).name === 'AbortError') {
        // User cancelled, not an error
        return;
      }

      setStreamState(prev => ({
        ...prev,
        error: (error as Error).message,
      }));

      // Remove empty assistant message on error
      setMessages(prev => prev.slice(0, -1));
    } finally {
      setStreamState(prev => ({ ...prev, isStreaming: false }));
    }
  }, [messages]);

  const cancelStream = useCallback(() => {
    abortControllerRef.current?.abort();
    setStreamState(prev => ({ ...prev, isStreaming: false }));
  }, []);

  const clearMessages = useCallback(() => {
    setMessages([]);
    setStreamState({ isStreaming: false, error: null, usage: null });
  }, []);

  return {
    messages,
    streamState,
    sendMessage,
    cancelStream,
    clearMessages,
  };
}

Rendering Streaming Content Efficiently

When content updates 10-50 times per second during streaming, naive React rendering can cause performance issues. Here’s how to optimize:

// components/MessageContent.tsx
import { memo, useMemo } from 'react';
import { marked } from 'marked';
import DOMPurify from 'dompurify';

interface MessageContentProps {
  content: string;
  isStreaming: boolean;
}

// Memoize to prevent re-renders from parent updates
export const MessageContent = memo(function MessageContent({
  content,
  isStreaming,
}: MessageContentProps) {
  // Only parse markdown when not streaming (expensive operation)
  const renderedContent = useMemo(() => {
    if (isStreaming) {
      // During streaming, just show raw text with basic formatting
      return content.split('n').map((line, i) => (
        <span key={i}>
          {line}
          {i < content.split('n').length - 1 && <br />}
        </span>
      ));
    }

    // After streaming complete, render full markdown
    const html = marked(content, { breaks: true, gfm: true });
    const sanitized = DOMPurify.sanitize(html);
    return <div dangerouslySetInnerHTML={{ __html: sanitized }} />;
  }, [content, isStreaming]);

  return (
    <div className="message-content">
      {renderedContent}
      {isStreaming && <span className="cursor-blink"></span>}
    </div>
  );
});

The Complete Chat Component

// components/StreamingChat.tsx
import { useState, useRef, useEffect, FormEvent } from 'react';
import { useStreamingChat } from '../hooks/useStreamingChat';
import { MessageContent } from './MessageContent';

export function StreamingChat() {
  const [input, setInput] = useState('');
  const messagesEndRef = useRef<HTMLDivElement>(null);
  const inputRef = useRef<HTMLTextAreaElement>(null);

  const {
    messages,
    streamState,
    sendMessage,
    cancelStream,
    clearMessages,
  } = useStreamingChat();

  // Auto-scroll to bottom on new messages
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);

  // Auto-resize textarea
  useEffect(() => {
    if (inputRef.current) {
      inputRef.current.style.height = 'auto';
      inputRef.current.style.height = `${inputRef.current.scrollHeight}px`;
    }
  }, [input]);

  const handleSubmit = async (e: FormEvent) => {
    e.preventDefault();
    if (!input.trim() || streamState.isStreaming) return;

    const message = input.trim();
    setInput('');
    await sendMessage(message);
  };

  const handleKeyDown = (e: React.KeyboardEvent) => {
    if (e.key === 'Enter' && !e.shiftKey) {
      e.preventDefault();
      handleSubmit(e);
    }
  };

  return (
    <div className="chat-container">
      {/* Messages Area */}
      <div className="messages-area">
        {messages.length === 0 ? (
          <div className="empty-state">
            <h2>Start a conversation</h2>
            <p>Send a message to begin chatting with AI</p>
          </div>
        ) : (
          messages.map((message, index) => (
            <div
              key={index}
              className={`message message-${message.role}`}
            >
              <div className="message-avatar">
                {message.role === 'user' ? '👤' : '🤖'}
              </div>
              <MessageContent
                content={message.content}
                isStreaming={
                  streamState.isStreaming &&
                  index === messages.length - 1 &&
                  message.role === 'assistant'
                }
              />
            </div>
          ))
        )}
        <div ref={messagesEndRef} />
      </div>

      {/* Error Display */}
      {streamState.error && (
        <div className="error-banner">
          <span>⚠️ {streamState.error}</span>
          <button onClick={clearMessages}>Dismiss</button>
        </div>
      )}

      {/* Token Usage */}
      {streamState.usage && (
        <div className="usage-info">
          Tokens used: {streamState.usage.totalTokens}
        </div>
      )}

      {/* Input Area */}
      <form onSubmit={handleSubmit} className="input-area">
        <textarea
          ref={inputRef}
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyDown={handleKeyDown}
          placeholder="Type a message... (Shift+Enter for new line)"
          disabled={streamState.isStreaming}
          rows={1}
        />

        {streamState.isStreaming ? (
          <button type="button" onClick={cancelStream} className="cancel-btn">
            ⏹ Stop
          </button>
        ) : (
          <button type="submit" disabled={!input.trim()}>
            Send →
          </button>
        )}
      </form>
    </div>
  );
}

Part 3: Alternative Approaches — When SSE Isn’t Enough

While SSE covers 90% of use cases, some scenarios require different approaches.

WebSockets for Bidirectional Communication

Use WebSockets when you need:

  • Real-time interrupts: Allow users to stop generation mid-stream
  • Multiplexing: Multiple concurrent streams over one connection
  • Bidirectional control: Server-initiated status updates
// WebSocket streaming implementation
class StreamingWebSocket {
  private ws: WebSocket;
  private messageHandlers = new Map<string, (data: any) => void>();

  constructor(url: string) {
    this.ws = new WebSocket(url);

    this.ws.onmessage = (event) => {
      const data = JSON.parse(event.data);
      const handler = this.messageHandlers.get(data.streamId);
      if (handler) handler(data);
    };
  }

  async streamChat(
    messages: Message[],
    onChunk: (content: string) => void
  ): Promise<{ streamId: string; cancel: () => void }> {
    const streamId = crypto.randomUUID();

    this.messageHandlers.set(streamId, (data) => {
      if (data.type === 'content') {
        onChunk(data.content);
      }
    });

    this.ws.send(JSON.stringify({
      type: 'start_stream',
      streamId,
      messages,
    }));

    return {
      streamId,
      cancel: () => {
        this.ws.send(JSON.stringify({ type: 'cancel', streamId }));
        this.messageHandlers.delete(streamId);
      },
    };
  }
}

HTTP/2 Server Push

If your infrastructure supports HTTP/2, you can leverage server push for even lower latency. The key advantage is multiplexing multiple streams over a single TCP connection without head-of-line blocking.

Using the Vercel AI SDK

For production applications, consider the Vercel AI SDK, which abstracts away much of this complexity:

// Using Vercel AI SDK
import { useChat } from 'ai/react';

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading } = useChat({
    api: '/api/chat',
  });

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>{m.role}: {m.content}</div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
        <button type="submit" disabled={isLoading}>Send</button>
      </form>
    </div>
  );
}

Part 4: Production Considerations

Error Recovery and Retry Logic

Implement exponential backoff for transient failures:

async function streamWithRetry(
  messages: Message[],
  maxRetries = 3
): AsyncGenerator<string> {
  let attempt = 0;

  while (attempt < maxRetries) {
    try {
      yield* streamFromAPI(messages);
      return; // Success, exit
    } catch (error) {
      attempt++;

      if (attempt >= maxRetries) throw error;

      // Exponential backoff: 1s, 2s, 4s
      const delay = Math.pow(2, attempt - 1) * 1000;
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

Rate Limiting and Queue Management

When multiple users are streaming simultaneously:

// Simple token bucket rate limiter
class RateLimiter {
  private tokens: number;
  private lastRefill: number;

  constructor(
    private maxTokens: number,
    private refillRate: number // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }

  async acquire(): Promise<boolean> {
    this.refill();

    if (this.tokens >= 1) {
      this.tokens -= 1;
      return true;
    }

    return false;
  }

  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }
}

Monitoring and Observability

Track these key metrics:

  1. Time to First Token (TTFT): When users first see content
  2. Tokens per Second (TPS): Generation speed
  3. Stream Completion Rate: Percentage of streams that complete without error
  4. Connection Duration: How long streams stay open
// Prometheus-style metrics
import { Counter, Histogram } from 'prom-client';

const streamDuration = new Histogram({
  name: 'llm_stream_duration_seconds',
  help: 'Duration of LLM streaming requests',
  buckets: [0.5, 1, 2, 5, 10, 30, 60],
});

const tokensGenerated = new Counter({
  name: 'llm_tokens_generated_total',
  help: 'Total tokens generated',
});

const streamErrors = new Counter({
  name: 'llm_stream_errors_total',
  help: 'Total streaming errors',
  labelNames: ['error_type'],
});

Cost Optimization

Streaming doesn’t reduce API costs, but you can optimize:

  1. Token estimation: Predict output length and warn users before expensive operations
  2. Caching: Cache common responses (with appropriate invalidation)
  3. Model selection: Use cheaper models for simple tasks, expensive models for complex ones
// Intelligent model selection
function selectModel(prompt: string): string {
  const estimatedComplexity = analyzePromptComplexity(prompt);

  if (estimatedComplexity < 0.3) {
    return 'gpt-3.5-turbo'; // $0.0005/1K tokens
  } else if (estimatedComplexity < 0.7) {
    return 'gpt-4-turbo-preview'; // $0.01/1K tokens  
  } else {
    return 'gpt-4'; // $0.03/1K tokens
  }
}

Part 5: Handling Edge Cases

Very Long Responses

For responses that might exceed browser memory:

// Virtual scrolling for very long responses
import { FixedSizeList as List } from 'react-window';

function VirtualizedMessage({ content }: { content: string }) {
  const lines = content.split('n');

  return (
    <List
      height={400}
      itemCount={lines.length}
      itemSize={24}
      width="100%"
    >
      {({ index, style }) => (
        <div style={style}>{lines[index]}</div>
      )}
    </List>
  );
}

Code Blocks in Streaming Content

Detect incomplete code blocks to avoid syntax highlighting flicker:

function isCompleteCodeBlock(content: string): boolean {
  const openBlocks = (content.match(/```
{% endraw %}
/g) || []).length;
  return openBlocks % 2 === 0; // Even number means all blocks are closed
}

function MessageContent({ content, isStreaming }: Props) {
  const shouldHighlight = !isStreaming || isCompleteCodeBlock(content);

  // Only apply syntax highlighting when appropriate
  const processed = shouldHighlight 
    ? highlightCode(content) 
    : escapeHtml(content);

  return <div dangerouslySetInnerHTML={{ __html: processed }} />;
}
{% raw %}

Mobile and Low-Bandwidth Considerations


typescript
// Adaptive chunking based on connection speed
function getChunkingStrategy(): 'immediate' | 'batched' {
  if ('connection' in navigator) {
    const connection = (navigator as any).connection;

    if (connection.effectiveType === '2g' || 
        connection.effectiveType === 'slow-2g') {
      return 'batched'; // Reduce render frequency
    }
  }

  return 'immediate';
}


Conclusion

Streaming LLM responses is no longer optional—it’s expected by users who’ve been trained by ChatGPT and Claude. The good news is that with the patterns in this guide, you can build streaming experiences that match or exceed what the major providers offer.

Key takeaways:

  1. SSE is your default choice: Simple, HTTP-based, works through proxies
  2. Handle backpressure and timeouts: Production environments are harsh
  3. Optimize rendering: Markdown parsing and DOM updates are expensive during streaming
  4. Monitor everything: TTFT, TPS, and error rates are your key metrics
  5. Plan for edge cases: Long responses, code blocks, and mobile users need special handling

The code examples in this guide are production-tested patterns. Take them, adapt them to your stack, and build AI experiences that feel magical.

Happy streaming! 🚀

💡 Note: This article was originally published on the Pockit Blog.

Leave a Reply