Implementing Circuit Breakers for LLM APIs: SRE Patterns for AI Reliability

Authors
  • avatar
    Name
    Nino
    Occupation
    Senior Tech Editor

You have finally shipped your flagship AI feature. Users are onboarding, the feedback is glowing, and retention is spiking. Then, at 2:00 PM on a Tuesday, OpenAI returns a 429 Too Many Requests error for every single call. Within minutes, your entire product is down, your support queue is overflowing, and your social media mentions are a disaster. If you have built anything on top of LLM APIs, you have likely felt this specific, acute pain.

The irony is that we solved this exact problem in distributed systems over fifteen years ago. Concepts like circuit breakers, health checks, and failover chains are standard in microservice architectures. However, most LLM integrations today remain remarkably fragile—often consisting of raw API calls wrapped in a simple try/except block and a prayer. As an SRE who has spent a decade building reliability into production systems at scale, I was shocked to see how the industry regressed when it came to AI infrastructure.

When you use a high-performance aggregator like n1n.ai, you gain access to multiple backends, but you still need a logic layer to handle failures gracefully. This post covers the architectural patterns that actually work in production.

Why LLM APIs Require a Different Reliability Strategy

Standard REST APIs are generally predictable. If a database is slow, it stays slow until the load decreases. LLM providers, however, introduce unique failure modes that traditional retry logic cannot handle effectively:

  1. Aggressive and Unpredictable Rate Limits: Unlike a standard SaaS API where you might have 1,000 requests per minute, LLM limits (TPM/RPM) can fluctuate based on the provider's internal cluster load. An OpenAI 429 can hit mid-conversation without any warning.
  2. Extreme Latency Variance: In a typical microservice, a 500ms response is "slow." In the LLM world, the same prompt might take 800ms one minute and 15 seconds the next, depending on the current queue depth for models like Claude 3.5 Sonnet or OpenAI o3.
  3. Frequent Regional Outages: Every major provider has experienced multi-hour outages in the past year. Relying on a single provider is a single point of failure (SPOF).
  4. Silent Quality Degradation: During peak load, some providers may route traffic to smaller, quantized versions of their models. The API returns a 200 OK, but the response quality drops significantly.

Most developers attempt to handle this with "Prayer-Based Reliability":

# The "prayer-based reliability" approach
import time
import openai

def call_llm_with_hope(messages):
    for attempt in range(3):
        try:
            response = openai.chat.completions.create(
                model="gpt-4o",
                messages=messages
            )
            return response
        except Exception:
            # Exponential backoff
            time.sleep(2 ** attempt)
    raise Exception("All retries failed")

This is better than nothing, but it has critical flaws. You are forcing your user to wait through every retry (potentially 15+ seconds of latency). You are hammering a provider that is already struggling, contributing to the "thundering herd" problem. Most importantly, your system is not learning from these failures.

The Circuit Breaker Pattern

The circuit breaker pattern comes from electrical engineering. When the current exceeds safe levels, the breaker trips and stops the flow to prevent a fire. In software, it means: if a service is failing, stop sending it traffic immediately instead of waiting for each request to time out.

By integrating n1n.ai, you can easily switch between models when a circuit trips. The state machine for a robust circuit breaker looks like this:

  • CLOSED (Healthy): Requests flow normally. If a request fails, increment a failure counter.
  • OPEN (Broken): If the failure threshold is reached (e.g., 5 failures in 1 minute), the circuit trips to OPEN. All subsequent requests fail instantly without even calling the API. This protects your user experience by failing fast.
  • HALF-OPEN (Testing): After a "cool-down" period, the breaker allows exactly one probe request through. If it succeeds, the circuit closes. If it fails, the timer resets.

Implementation in Python

class CircuitBreaker:
    def __init__(self, failure_threshold=3, cool_down_seconds=30):
        self.state = "CLOSED"
        self.failure_count = 0
        self.threshold = failure_threshold
        self.cool_down = cool_down_seconds
        self.last_failure_time = None

    def can_execute(self):
        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            # Check if cool-down period has passed
            elapsed = time.time() - self.last_failure_time
            if elapsed >= self.cool_down:
                self.state = "HALF_OPEN"
                return True  # Allow one probe request
            return False  # Fail fast

        if self.state == "HALF_OPEN":
            return False  # Only one probe at a time

        return False

    def record_success(self):
        self.failure_count = 0
        self.state = "CLOSED"

    def record_failure(self):
        self.failure_count += 1
        self.last_failure_time = time.time()
        if self.failure_count >= self.threshold:
            self.state = "OPEN"

The key insight here is that when a circuit is open, you fail in microseconds. Your user gets a fast fallback response instead of watching a loading spinner for 30 seconds while retries pile up against a dead endpoint.

Failover Chains: The Multi-Model Strategy

A circuit breaker alone only tells you something is broken. You still need somewhere to send the traffic. This is where Failover Chains come into play. Instead of depending on a single model, you define an ordered list of targets.

Using a service like n1n.ai allows you to unify multiple providers (OpenAI, Anthropic, Google, DeepSeek) under a single interface, making failover chains trivial to implement.

failover_chain = [
    {"model": "gpt-4o", "provider": "openai"},
    {"model": "claude-3-5-sonnet", "provider": "anthropic"},
    {"model": "deepseek-v3", "provider": "deepseek"},
]

async def route_with_failover(messages, chain):
    route_trace = []

    for i, target in enumerate(chain):
        model = target["model"]
        breaker = get_circuit_breaker(model)

        if not breaker.can_execute():
            route_trace.append({"model": model, "action": "skipped", "reason": "circuit_open"})
            continue

        try:
            start = time.monotonic()
            response = await call_model(model, messages)
            latency = time.monotonic() - start

            breaker.record_success()
            update_latency_tracker(model, latency)
            return response, route_trace

        except Exception as e:
            breaker.record_failure()
            route_trace.append({"model": model, "action": "failed", "reason": str(e)})
            continue

    raise AllProvidersFailedError(route_trace)

Dynamic Routing via Exponential Smoothing

Static failover chains are a great start, but model performance is dynamic. A model that was fast at 8:00 AM might be sluggish by noon. To solve this, we track real-time latency using Exponential Smoothing. Simple averaging is insufficient because it weights yesterday's data as heavily as today's.

def update_latency(model, new_latency_ms):
    alpha = 0.2  # Weight for the newest measurement
    current_avg = get_avg_latency(model)

    # Recent measurements matter more
    smoothed = (current_avg * (1 - alpha)) + (new_latency_ms * alpha)
    set_avg_latency(model, smoothed)
    return smoothed

With an alpha of 0.2, a model that recovers from a slow period will show improved latency within 5-6 requests. This allows your routing logic to automatically pick the fastest healthy model at any given moment.

Advanced Weighted Routing

In a production environment, you don't always want the fastest model. Sometimes you want the cheapest, or the most reliable for a specific task like RAG (Retrieval-Augmented Generation). You can define weight vectors for different business goals:

ROUTING_STRATEGIES = {
    "balanced":    {"success_rate": 0.4, "latency": 0.3, "cost": 0.3},
    "speed":       {"success_rate": 0.2, "latency": 0.6, "cost": 0.2},
    "cost":        {"success_rate": 0.2, "latency": 0.2, "cost": 0.6},
}

def score_model(model, strategy="balanced"):
    weights = ROUTING_STRATEGIES[strategy]
    stats = get_model_stats(model)

    # Normalize metrics (higher is better)
    inv_latency = 1.0 - (stats["avg_latency"] / max_latency_threshold)
    inv_cost = 1.0 - (stats["avg_cost"] / max_cost_threshold)

    score = (
        weights["success_rate"] * stats["success_rate"] +
        weights["latency"] * inv_latency +
        weights["cost"] * inv_cost
    )
    return score

This architecture allows you to set optimization_goal: "cost" for background batch jobs and optimization_goal: "speed" for your real-time chat interface, all using the same underlying infrastructure.

Replay Testing: The Safety Net

The most overlooked pattern is Replay Testing. Before deploying a change to your routing policy, you should replay historical traffic through the new algorithm to see how it would have performed. This prevents the dreaded "I updated the config and now latency doubled" scenario.

By logging every request's model, latency, and success status, you can simulate a week's worth of traffic in seconds. This gives your team the confidence to iterate on infrastructure without risking the production environment.

Summary

Building a resilient AI product requires moving beyond simple API calls. By implementing circuit breakers, you protect your users from hung requests. By using failover chains, you eliminate single points of failure. And by utilizing dynamic routing, you optimize for both cost and performance.

Integrating these patterns manually can be complex, which is why developers choose n1n.ai. It provides the high-speed, multi-model backbone necessary to implement these SRE patterns with minimal overhead. Whether you are building with LangChain, LlamaIndex, or custom code, reliability should be a first-class citizen in your AI stack.

Get a free API key at n1n.ai