Integrating LLMs Into Frontend: Streaming Responses, Optimistic UI, and Error Handling in React
Adding an AI feature to a React app in 2025 takes an afternoon. Building one that handles streaming correctly, degrades gracefully, stays within rate limits, and does not tank your Core Web Vitals takes considerably longer.
We shipped our first LLM-powered feature — a content generation assistant for our CMS — in March 2025. The initial prototype took two days. Making it production-ready took three weeks. The gap was almost entirely in the frontend engineering: streaming UX, error states, cancellation, rate limiting, and performance.
This article covers what we built and why each piece matters.
The Streaming Foundation
LLM responses are slow by nature — a full response can take 10–30 seconds. Streaming (server-sent tokens) is essential for a usable UX. Without it, users stare at a spinner for half a minute before seeing anything.
The browser's Fetch API supports streaming natively. The key is reading the response body as a stream rather than waiting for the full response:
// lib/stream-llm.ts
export async function* streamLLM(
prompt: string,
signal: AbortSignal,
): AsyncGenerator<string> {
const response = await fetch('/api/ai/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ prompt }),
signal,
});
if (!response.ok) {
const error = await response.json().catch(() => ({}));
throw new LLMError(response.status, error.message ?? 'Generation failed');
}
const reader = response.body!.getReader();
const decoder = new TextDecoder();
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Server sends newline-delimited JSON: {"token":"Hello"}
const lines = decoder.decode(value, { stream: true }).split('\n');
for (const line of lines) {
if (!line.trim()) continue;
const { token } = JSON.parse(line) as { token: string };
yield token;
}
}
} finally {
reader.releaseLock();
}
}
The generator function lets callers consume tokens one at a time without buffering the entire response in memory. This matters for long-form content generation.
The useGeneration Hook
The streaming function is the plumbing. The hook is what React components actually use. It manages state, cancellation, and error handling in one place:
// hooks/use-generation.ts
type GenerationState =
| { status: 'idle' }
| { status: 'streaming'; content: string; abort: () => void }
| { status: 'done'; content: string }
| { status: 'error'; error: string };
export function useGeneration() {
const [state, setState] = useState<GenerationState>({ status: 'idle' });
const generate = useCallback(async (prompt: string) => {
const controller = new AbortController();
setState({ status: 'streaming', content: '', abort: () => controller.abort() });
try {
let content = '';
for await (const token of streamLLM(prompt, controller.signal)) {
content += token;
// Batch state updates — avoid re-rendering on every token
setState(prev =>
prev.status === 'streaming'
? { ...prev, content }
: prev
);
}
setState({ status: 'done', content });
} catch (err) {
if (err instanceof DOMException && err.name === 'AbortError') {
setState({ status: 'idle' }); // User cancelled — not an error
return;
}
const message = err instanceof LLMError ? err.message : 'Unexpected error';
setState({ status: 'error', error: message });
}
}, []);
return { state, generate };
}
AbortError from real errors is important. When the user cancels generation, resetting to idle is the correct behavior. Showing an error state would be confusing — the user intentionally stopped the action.
Optimistic UI for Chat
In a chat interface, the user's message should appear immediately — before the server acknowledges it. This is optimistic UI: render the expected outcome immediately, then reconcile with the server response.
// components/ChatThread.tsx
type Message = {
id: string;
role: 'user' | 'assistant';
content: string;
status: 'sent' | 'pending' | 'error';
};
export function ChatThread() {
const [messages, setMessages] = useState<Message[]>([]);
const { state, generate } = useGeneration();
const sendMessage = (text: string) => {
const userMsg: Message = {
id: crypto.randomUUID(),
role: 'user',
content: text,
status: 'sent',
};
// Optimistically add user message + placeholder for assistant
const assistantPlaceholder: Message = {
id: crypto.randomUUID(),
role: 'assistant',
content: '',
status: 'pending',
};
setMessages(prev => [...prev, userMsg, assistantPlaceholder]);
generate(text);
};
// Sync streaming content into the placeholder message
useEffect(() => {
if (state.status === 'streaming' || state.status === 'done') {
setMessages(prev => prev.map((msg, i) =>
i === prev.length - 1
? { ...msg, content: state.content, status: state.status === 'done' ? 'sent' : 'pending' }
: msg
));
}
}, [state]);
return (
<section aria-label="Chat thread" aria-live="polite">
{messages.map(msg => <ChatMessage key={msg.id} message={msg} />)}
{state.status === 'streaming' && (
<button onClick={state.abort} aria-label="Stop generating">Stop</button>
)}
</section>
);
}
Rate Limiting on the Frontend
Server-side rate limiting is necessary. Frontend rate limiting is an additional quality-of-life layer — it gives users immediate feedback before the server rejects the request, and it protects against accidental rapid-fire submissions (double-clicks, keyboard repeats).
// hooks/use-rate-limit.ts
export function useRateLimit(maxRequests: number, windowMs: number) {
const timestamps = useRef<number[]>([]);
const check = useCallback((): { allowed: boolean; retryAfterMs: number } => {
const now = Date.now();
const windowStart = now - windowMs;
// Evict old timestamps
timestamps.current = timestamps.current.filter(t => t > windowStart);
if (timestamps.current.length >= maxRequests) {
const oldest = timestamps.current[0];
return { allowed: false, retryAfterMs: oldest + windowMs - now };
}
timestamps.current.push(now);
return { allowed: true, retryAfterMs: 0 };
}, [maxRequests, windowMs]);
return check;
}
// Usage: 5 requests per 60 seconds
const checkRateLimit = useRateLimit(5, 60_000);
const handleGenerate = () => {
const { allowed, retryAfterMs } = checkRateLimit();
if (!allowed) {
const seconds = Math.ceil(retryAfterMs / 1000);
showToast(`Please wait ${seconds}s before generating again.`);
return;
}
generate(prompt);
};
Core Web Vitals Under LLM Load
Streaming token updates re-render the output component on every token. At typical LLM speeds (30–80 tokens/second), this can cause significant main thread pressure and hurt INP.
Two mitigations we use:
1. Throttle state updates
// Batch updates every 50ms rather than every token
const flushTimer = useRef<ReturnType<typeof setTimeout> | null>(null);
const bufferRef = useRef('');
for await (const token of streamLLM(prompt, signal)) {
bufferRef.current += token;
if (!flushTimer.current) {
flushTimer.current = setTimeout(() => {
setContent(bufferRef.current);
flushTimer.current = null;
}, 50);
}
}
2. Use CSS for the typing cursor, not JavaScript
/* The blinking cursor is pure CSS — no JS timer needed */
.streaming-content::after {
content: '▋';
animation: blink 1s step-end infinite;
color: var(--c-accent);
}
@keyframes blink {
0%, 100% { opacity: 1; }
50% { opacity: 0; }
}
content-visibility: auto container if it can scroll out of view. The browser will skip paint for off-screen content, reducing the rendering cost of rapid updates.
Error Handling and Degradation
LLM APIs have higher error rates than typical REST APIs: timeouts, rate limit rejections, model overload responses, and partial content truncation are all real. Design error states proactively:
- Timeout — set a maximum wait for the first token (not the full response). If no token arrives in 10s, abort and show an error.
- Partial response — if the stream closes mid-sentence, show the partial content with a note that generation was incomplete. Do not discard it.
- Rate limit (429) — surface the retry-after time to the user. Do not silently retry in the background.
- Model unavailable (503) — offer a fallback path or a manual alternative where possible.
Key Takeaways
- Use async generators for streaming — they compose naturally with React's state model
- Distinguish
AbortErrorfrom real errors — user cancellation is not an error state - Add optimistic UI for chat interfaces — user messages must appear immediately
- Layer frontend rate limiting on top of server-side rate limiting for better UX
- Throttle streaming state updates to 50ms batches to protect INP
- Use CSS for the typing cursor — not a JavaScript timer
- Design for partial responses and model unavailability from the start — LLM APIs are less reliable than typical REST APIs