LLM rate limits and retries: a reliability checklist for AI apps (2026)
A practical checklist to make AI apps reliable under rate limits and transient failures. Learn how to set timeouts, retries, backoff, idempotency, and fallbacks without creating hidden loops or runaway costs.
Table of Contents
- Conclusion
- Explanation
- Practical Guide
- Step 1: classify your LLM operations (5 minutes)
- Step 2: set timeouts intentionally (5 minutes)
- Step 3: implement backoff correctly (10 minutes)
- Step 4: make side effects idempotent (10 minutes)
- Step 5: add fallbacks and degraded modes (5 minutes)
- Step 6: add cost guardrails (5 minutes)
- Pitfalls
- Checklist
- FAQ
- 1) Should I retry on every error?
- 2) What is the most common mistake?
- 3) How do I prevent retry storms?
- Internal links
- Disclaimer
How do you handle LLM rate limits and retries without breaking reliability or blowing up cost?
Conclusion
Rate limits and transient failures are normal for LLM providers. A reliable AI app needs a repeatable policy for:
- timeouts
- retries with backoff
- idempotency and deduplication
- fallback models and degraded modes
- cost guardrails
If you do this well, users see “slower but works.” If you do it poorly, you get infinite retry loops, duplicated actions, and surprise bills.
Explanation
LLM calls fail in predictable ways:
- HTTP 429 (rate limit)
- 5xx (provider issues)
- network timeouts
- streaming disconnects
The tricky part is not “retry or not.” It’s avoiding these failure modes:
- retry storms (your retries become the outage)
- duplicate side effects (double emails, double writes)
- hidden loops in agents (tool call retries plus LLM retries)
- runaway cost (retries on expensive models)
The right approach is to separate:
- compute retries (safe to repeat) from
- side-effect actions (must be idempotent)
Practical Guide
Step 1: classify your LLM operations (5 minutes)
For each LLM call, label it:
- safe to retry (pure text generation)
- risky to retry (triggers tool calls, writes data)
Rule:
- never retry a side effect unless you can guarantee idempotency
Step 2: set timeouts intentionally (5 minutes)
Define:
- connect timeout
- overall request timeout
- streaming timeout
Also define:
- max tokens and max output size
Without timeouts, your queue becomes your outage.
Step 3: implement backoff correctly (10 minutes)
Use:
- exponential backoff with jitter
- a hard cap on total attempts
Suggested starting point:
- max attempts: 3
- base delay: 250ms
- max delay: 4s
For 429, prefer:
- respect Retry-After if provided
- queue rather than immediate retry
Step 4: make side effects idempotent (10 minutes)
Every “do something” step should accept an idempotency key.
Examples:
- sending email
- creating a ticket
- writing to CRM
- charging payments
Minimum pattern:
- request_id becomes the idempotency key
- store a short-lived dedupe record
If you can’t dedupe, you can’t retry safely.
Step 5: add fallbacks and degraded modes (5 minutes)
Fallback options:
- cheaper/smaller model for retries
- cached response for common queries
- “read-only mode” (disable tools)
- “draft mode” (don’t auto-send actions)
The goal is predictable behavior, not perfect output.
Step 6: add cost guardrails (5 minutes)
Log per request:
- tokens_in, tokens_out
- total attempts
- model used per attempt
Guardrails:
- cap retries on expensive models
- cap tokens per request
- per-account rate limits
Pitfalls
- retrying tool calls without idempotency keys
- no jitter (synchronized retries)
- retrying on 4xx that will never succeed
- agent loops that re-issue the same tool call
- using the same expensive model for every retry
Checklist
- [ ] I classified LLM calls as safe-to-retry vs side-effect
- [ ] Timeouts are explicitly set (connect + total + streaming)
- [ ] Max tokens/output limits exist
- [ ] Retries use exponential backoff with jitter
- [ ] Total attempts are capped
- [ ] 429 handling respects Retry-After when available
- [ ] Side effects use idempotency keys
- [ ] Duplicate actions are deduped by request_id
- [ ] Fallback model exists for retry/degraded mode
- [ ] Tools can be disabled (read-only/degraded)
- [ ] Cost per request is logged (tokens, attempts, model)
- [ ] Per-account limits exist to prevent abuse
FAQ
1) Should I retry on every error?
No. Retry only transient errors (429, 5xx, timeouts). Do not retry deterministic 4xx.
2) What is the most common mistake?
Retrying side effects without idempotency. That’s how you get duplicated writes and emails.
3) How do I prevent retry storms?
Add jitter, cap attempts, and use queues for 429 instead of tight loops.
Internal links
- Hub: AI development
- Related:
Disclaimer
General ops guidance only.
Popular
- 1Permit2 explained (Web3): why approvals changed and how to use it safely (checklist)
- 2Read wallet signing screens (Web3): a 30-second checklist to avoid permission traps
- 3Spec-to-implementation prompt template (AI development): how to stop the model from guessing
- 4Revoke token approvals on EVM: how to audit allowances safely (checklist)
- 5Clarifying questions checklist (AI development): what to ask before you let an LLM build