Without circuit breakers, cascading failures can take down your entire platform in under 90 seconds. Michael Nygard popularized the circuit breaker pattern in his book Release It! (2018) to prevent exactly this scenario: when Service A depends on Service B, and B starts timing out, A's threads pile up waiting for responses, exhausting its connection pool, which causes A to fail, which cascades to Service C that depends on A. The circuit breaker pattern breaks this chain by detecting when a dependency is struggling and short-circuiting calls to it — returning an immediate error or a cached fallback rather than waiting for a timeout.
The circuit breaker is a state machine with three states. Closed (normal operation): requests flow through, failures are counted. If failures exceed a threshold within a time window, the breaker trips to Open. Open (failure mode): all requests are immediately rejected with an error — no calls are made to the failing service. After a configured timeout, the breaker moves to Half-Open. Half-Open (probe): a limited number of test requests are allowed through. If they succeed, the breaker closes. If they fail, it returns to Open. This self-healing behavior is what makes the circuit breaker valuable — it automatically recovers when the downstream service recovers.
Circuit breaker configuration requires three key numbers: failure rate threshold (percentage of failures that trips the breaker — typically 50%), wait duration in open state (how long to wait before trying half-open — typically 10-30 seconds), and permitted calls in half-open state (how many test requests to allow — typically 3-5). Getting these numbers right for your workload requires measurement: check your service's normal failure rate and response time distribution before setting thresholds. Too sensitive, and transient errors trip the breaker unnecessarily. Too lenient, and cascading failures propagate before the breaker activates.
Circuit Breaker State Machine
Failure rate > 50%
┌──────────┐ ───────────────────► ┌──────────┐
│ CLOSED │ │ OPEN │
│(normal) │ ◄─────────────────── │(rejected)│
└──────────┘ Success in half-open└────┬─────┘
│ │
│ Requests pass through │ All requests rejected
│ Failures counted │ No calls to downstream
│ │
│ After 30s timeout
│ │
│ ┌──────▼──────┐
│ ◄─ 5 test requests ok ── │ HALF-OPEN │
│ 2 fail → back to Open │ (probing) │
└─────────────┘
Cascading Failure WITHOUT Circuit Breaker:
Service A times out (30s) → thread pool exhausted
→ Service B calls A → B times out → B thread pool exhausted
→ Service C calls B → C fails
ENTIRE PLATFORM DOWN in < 90 seconds
WITH Circuit Breaker:
A fails → CB opens → B gets immediate error → B uses fallback
→ C continues normally with cached/degraded responseFrom building the multi-level approval workflow in Commsult's ERP: wrap external service calls (email provider, PDF generator, payment gateway) with circuit breakers, but not internal service calls that share the same database. The circuit breaker is for protecting against network-level failures to external or remote dependencies. Internal module calls that fail are typically programming errors — they should surface immediately as errors, not be circuit-broken.
The most complete circuit breaker implementation for NestJS comes from nestjs-resilience4j, which wraps the Java Resilience4j library's concepts in a NestJS-native API. Alternatively, the cockatiel library provides a lightweight TypeScript-native implementation with Circuit Breaker, Retry, and Timeout policies that compose well. For a NestJS service calling an external HTTP API, wrap the call with a CircuitBreaker policy — the library handles state tracking, error counting, and timeout logic. Expose the circuit state via a health check endpoint so your load balancer can route around an open-circuit service instance.
// Using 'cockatiel' — lightweight TypeScript circuit breaker
import { CircuitBreakerPolicy, timeout, TimeoutStrategy, retry, ExponentialBackoff } from 'cockatiel';
// Compose: Timeout → Retry → Circuit Breaker
const circuitBreaker = new CircuitBreakerPolicy({
halfOpenAfter: 30_000, // 30s before probing
breaker: consecutivelyFailed(5), // open after 5 consecutive failures
});
const retryPolicy = retry(handleAll, {
maxAttempts: 3,
backoff: new ExponentialBackoff(),
});
const timeoutPolicy = timeout(5_000, TimeoutStrategy.Cooperative); // 5s
// email.service.ts — wrapping external SMTP API
@Injectable()
export class EmailService {
private policy = circuitBreaker.wrap(retryPolicy.wrap(timeoutPolicy));
async sendInvoice(data: InvoiceEmailData): Promise<void> {
try {
await this.policy.execute(() =>
this.smtpClient.send({
to: data.recipientEmail,
subject: 'Invoice Ready',
html: this.templateService.render('invoice', data),
})
);
} catch (error) {
// Circuit is open OR all retries failed
// Log and queue for later retry via BullMQ
this.logger.error('Email send failed, queuing for retry', { error, data });
await this.retryQueue.add('email:retry', data, { delay: 60_000 });
}
}
}
// Health endpoint — expose circuit state
@Get('/health/circuits')
getCircuitHealth() {
return {
emailService: circuitBreaker.state, // 'closed' | 'open' | 'half-open'
pdfService: pdfCircuitBreaker.state,
paymentGateway: paymentCircuitBreaker.state,
};
}Circuit breakers work best in combination with retry policies and timeouts. The typical composition: Timeout (fail fast if response takes too long) → Retry (retry on transient errors) → Circuit Breaker (trip if too many retries fail). In code, wrap inner policies with outer: circuitBreaker.execute(() => retry.execute(() => timeout.execute(() => apiCall()))). Be careful with retry + circuit breaker interaction: retries amplify failure signals. If you retry 3 times on every call and have a failure threshold of 5, three failed requests count as 9 failure events. Tune your retry count and circuit breaker threshold together.
An open circuit breaker means something is wrong downstream — but it also means your service is silently dropping requests. Without monitoring, you'll discover an open circuit breaker when a user complains their request 'just returns an error instantly.' Instrument every circuit breaker with metrics: track state transitions (closed→open, open→half-open→closed), rejection count, and the error rate that triggered the trip. Alert when a circuit breaker transitions to Open state. In NestJS, expose circuit breaker state via a /health endpoint that your monitoring system polls.
When the circuit breaker rejects a request, what should your service return? The fallback strategy depends on the business criticality of the downstream call. Options: (1) Cached response — return the last successful response from a cache (good for read-only data that doesn't change rapidly); (2) Default response — return a safe default (empty list, zero balance); (3) Degraded response — return partial data from a different, available source; (4) Error response — return a structured error with context (503 Service Unavailable with Retry-After header). Avoid silent success returns (returning an empty success that hides the failure from the caller).
One powerful deployment is putting circuit breakers in your API gateway — the gateway tracks the health of each downstream service and trips the breaker when a service's error rate spikes. This gives you protection at the perimeter without requiring every service to implement its own breakers. Kong and AWS API Gateway have circuit breaker plugins. In a NestJS gateway, implement it using an interceptor that tracks response codes and response times per upstream service, tripping a per-service circuit when thresholds are crossed. This way, a degraded service doesn't affect your gateway's overall response time.
Circuit breakers are not useful if you can't verify they work. Write integration tests that inject failures into the downstream dependency (mock the HTTP call to throw errors) and verify: (1) the breaker transitions to Open after the threshold; (2) requests are rejected immediately in Open state; (3) the breaker transitions to Half-Open after the wait duration; (4) the breaker closes after successful test requests in Half-Open. Also chaos test in staging — use a tool like Chaos Monkey or simple firewall rules to make a dependency unavailable and verify your circuit breakers activate and your system degrades gracefully rather than failing completely.