Integrating LLMs Into Frontend: Streaming Responses, Optimistic UI, and Error Handling in React

We shipped our first LLM-powered feature — a content generation assistant for our CMS — in March 2025. The initial prototype took two days. Making it production-ready took three weeks. The gap was almost entirely in the frontend engineering: streaming UX, error states, cancellation, rate limiting, and performance.

This article covers what we built and why each piece matters.

The Streaming Foundation

LLM responses are slow by nature — a full response can take 10–30 seconds. Streaming (server-sent tokens) is essential for a usable UX. Without it, users stare at a spinner for half a minute before seeing anything.

The browser's Fetch API supports streaming natively. The key is reading the response body as a stream rather than waiting for the full response:

// lib/stream-llm.ts

export async function* streamLLM(
  prompt: string,
  signal: AbortSignal,
): AsyncGenerator<string> {
  const response = await fetch('/api/ai/generate', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ prompt }),
    signal,
  });

  if (!response.ok) {
    const error = await response.json().catch(() => ({}));
    throw new LLMError(response.status, error.message ?? 'Generation failed');
  }

  const reader = response.body!.getReader();
  const decoder = new TextDecoder();

  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      // Server sends newline-delimited JSON: {"token":"Hello"}
      const lines = decoder.decode(value, { stream: true }).split('\n');
      for (const line of lines) {
        if (!line.trim()) continue;
        const { token } = JSON.parse(line) as { token: string };
        yield token;
      }
    }
  } finally {
    reader.releaseLock();
  }
}

The generator function lets callers consume tokens one at a time without buffering the entire response in memory. This matters for long-form content generation.

The useGeneration Hook

The streaming function is the plumbing. The hook is what React components actually use. It manages state, cancellation, and error handling in one place:

// hooks/use-generation.ts

type GenerationState =
  | { status: 'idle' }
  | { status: 'streaming'; content: string; abort: () => void }
  | { status: 'done'; content: string }
  | { status: 'error'; error: string };

export function useGeneration() {
  const [state, setState] = useState<GenerationState>({ status: 'idle' });

  const generate = useCallback(async (prompt: string) => {
    const controller = new AbortController();

    setState({ status: 'streaming', content: '', abort: () => controller.abort() });

    try {
      let content = '';
      for await (const token of streamLLM(prompt, controller.signal)) {
        content += token;
        // Batch state updates — avoid re-rendering on every token
        setState(prev =>
          prev.status === 'streaming'
            ? { ...prev, content }
            : prev
        );
      }
      setState({ status: 'done', content });
    } catch (err) {
      if (err instanceof DOMException && err.name === 'AbortError') {
        setState({ status: 'idle' }); // User cancelled — not an error
        return;
      }
      const message = err instanceof LLMError ? err.message : 'Unexpected error';
      setState({ status: 'error', error: message });
    }
  }, []);

  return { state, generate };
}

Note Distinguishing AbortError from real errors is important. When the user cancels generation, resetting to idle is the correct behavior. Showing an error state would be confusing — the user intentionally stopped the action.

Optimistic UI for Chat

In a chat interface, the user's message should appear immediately — before the server acknowledges it. This is optimistic UI: render the expected outcome immediately, then reconcile with the server response.

// components/ChatThread.tsx

type Message = {
  id: string;
  role: 'user' | 'assistant';
  content: string;
  status: 'sent' | 'pending' | 'error';
};

export function ChatThread() {
  const [messages, setMessages] = useState<Message[]>([]);
  const { state, generate } = useGeneration();

  const sendMessage = (text: string) => {
    const userMsg: Message = {
      id: crypto.randomUUID(),
      role: 'user',
      content: text,
      status: 'sent',
    };

    // Optimistically add user message + placeholder for assistant
    const assistantPlaceholder: Message = {
      id: crypto.randomUUID(),
      role: 'assistant',
      content: '',
      status: 'pending',
    };

    setMessages(prev => [...prev, userMsg, assistantPlaceholder]);
    generate(text);
  };

  // Sync streaming content into the placeholder message
  useEffect(() => {
    if (state.status === 'streaming' || state.status === 'done') {
      setMessages(prev => prev.map((msg, i) =>
        i === prev.length - 1
          ? { ...msg, content: state.content, status: state.status === 'done' ? 'sent' : 'pending' }
          : msg
      ));
    }
  }, [state]);

  return (
    <section aria-label="Chat thread" aria-live="polite">
      {messages.map(msg => <ChatMessage key={msg.id} message={msg} />)}
      {state.status === 'streaming' && (
        <button onClick={state.abort} aria-label="Stop generating">Stop</button>
      )}
    </section>
  );
}

Rate Limiting on the Frontend

Server-side rate limiting is necessary. Frontend rate limiting is an additional quality-of-life layer — it gives users immediate feedback before the server rejects the request, and it protects against accidental rapid-fire submissions (double-clicks, keyboard repeats).

// hooks/use-rate-limit.ts

export function useRateLimit(maxRequests: number, windowMs: number) {
  const timestamps = useRef<number[]>([]);

  const check = useCallback((): { allowed: boolean; retryAfterMs: number } => {
    const now = Date.now();
    const windowStart = now - windowMs;

    // Evict old timestamps
    timestamps.current = timestamps.current.filter(t => t > windowStart);

    if (timestamps.current.length >= maxRequests) {
      const oldest = timestamps.current[0];
      return { allowed: false, retryAfterMs: oldest + windowMs - now };
    }

    timestamps.current.push(now);
    return { allowed: true, retryAfterMs: 0 };
  }, [maxRequests, windowMs]);

  return check;
}

// Usage: 5 requests per 60 seconds
const checkRateLimit = useRateLimit(5, 60_000);

const handleGenerate = () => {
  const { allowed, retryAfterMs } = checkRateLimit();
  if (!allowed) {
    const seconds = Math.ceil(retryAfterMs / 1000);
    showToast(`Please wait ${seconds}s before generating again.`);
    return;
  }
  generate(prompt);
};

Core Web Vitals Under LLM Load

Streaming token updates re-render the output component on every token. At typical LLM speeds (30–80 tokens/second), this can cause significant main thread pressure and hurt INP.

Two mitigations we use:

1. Throttle state updates

// Batch updates every 50ms rather than every token
const flushTimer = useRef<ReturnType<typeof setTimeout> | null>(null);
const bufferRef = useRef('');

for await (const token of streamLLM(prompt, signal)) {
  bufferRef.current += token;

  if (!flushTimer.current) {
    flushTimer.current = setTimeout(() => {
      setContent(bufferRef.current);
      flushTimer.current = null;
    }, 50);
  }
}

2. Use CSS for the typing cursor, not JavaScript

/* The blinking cursor is pure CSS — no JS timer needed */
.streaming-content::after {
  content: '▋';
  animation: blink 1s step-end infinite;
  color: var(--c-accent);
}

@keyframes blink {
  0%, 100% { opacity: 1; }
  50%       { opacity: 0; }
}

Performance tip Render the streaming output inside a content-visibility: auto container if it can scroll out of view. The browser will skip paint for off-screen content, reducing the rendering cost of rapid updates.

Error Handling and Degradation

LLM APIs have higher error rates than typical REST APIs: timeouts, rate limit rejections, model overload responses, and partial content truncation are all real. Design error states proactively:

Timeout — set a maximum wait for the first token (not the full response). If no token arrives in 10s, abort and show an error.
Partial response — if the stream closes mid-sentence, show the partial content with a note that generation was incomplete. Do not discard it.
Rate limit (429) — surface the retry-after time to the user. Do not silently retry in the background.
Model unavailable (503) — offer a fallback path or a manual alternative where possible.

Key Takeaways

Use async generators for streaming — they compose naturally with React's state model
Distinguish AbortError from real errors — user cancellation is not an error state
Add optimistic UI for chat interfaces — user messages must appear immediately
Layer frontend rate limiting on top of server-side rate limiting for better UX
Throttle streaming state updates to 50ms batches to protect INP
Use CSS for the typing cursor — not a JavaScript timer
Design for partial responses and model unavailability from the start — LLM APIs are less reliable than typical REST APIs