The circuit breaker pattern is a critical resilience mechanism in distributed systems. Like its electrical namesake, a software circuit breaker detects failures and stops the flow of requests to a failing service.
This gives the downstream service time to recover while preventing resource exhaustion in calling services. However, circuit breakers add operational complexity that requires careful monitoring.
What is the Circuit Breaker Pattern?
The circuit breaker pattern implements three states that control request flow to a protected resource.
The Three States
┌──────────────────────────────────────┐
│ │
▼ │
┌──────┐ failures exceed ┌──────┐ │
│CLOSED│──────threshold──────►│ OPEN │ │
└──────┘ └──────┘ │
▲ │ │
│ timeout │
│ expires │
│ │ │
│ probes ┌─────────┐ │ │
└───succeed──────│HALF-OPEN│◄──┘ │
└─────────┘ │
│ │
probes fail │
└────────────────┘
| State | Behavior |
|---|---|
| Closed | Requests pass through; failures are monitored |
| Open | Requests fail immediately; no downstream calls |
| Half-Open | Limited probe requests test recovery |
Popular Implementations
- Netflix Hystrix (legacy)
- Resilience4j (Java)
- Polly (.NET)
- Istio (service mesh level)
Why Circuit Breaker Monitoring Matters
Protection vs. Outage Risk
Circuit breakers are designed to protect your system. But they can also cause outages if misconfigured:
State Transitions as Health Signals
The state transitions of circuit breakers provide valuable signals:
| Pattern | Potential Issue |
|---|---|
| Frequent trips | Flaky dependency needs attention |
| Slow recovery | Half-open probe settings too conservative |
| Never opens | Thresholds too lenient to provide protection |
Interaction with Other Patterns
Circuit breakers interact with other reliability mechanisms:
- Retries: Might cause a circuit to open faster than expected
- Timeouts: Affect when failures are detected
- Bulkheads: Limit requests available for half-open probing
Understanding the complete picture requires correlated monitoring.
Incident Context
During incidents, circuit breaker status provides critical context. Knowing which circuits are open helps operators:
- Understand the scope of impact
- Guide recovery priorities
- Identify root cause vs. symptoms
How to Monitor Circuit Breakers
Instrument State Transitions
Every transition should emit an event with context:
circuitBreaker.on('stateChange', (oldState, newState) => {
metrics.emit('circuit_breaker_transition', {
circuit: circuitBreaker.name,
from: oldState,
to: newState,
timestamp: Date.now(),
failureCount: circuitBreaker.stats.failures,
errorRate: circuitBreaker.stats.errorRate
});
});
Track:
- Which circuit transitioned
- What triggered the transition
- Timestamp
- Relevant metrics at transition time
Track State as a Metric
Use numeric values for time-series visualization:
// State values for metrics
const STATE_VALUES = {
closed: 0,
halfOpen: 1,
open: 2
};
setInterval(() => {
Object.entries(circuitBreakers).forEach(([name, cb]) => {
metrics.gauge('circuit_breaker_state', STATE_VALUES[cb.state], {
circuit: name
});
});
}, 1000);
This enables correlation with other system metrics over time.
Monitor Underlying Metrics
Track the metrics that drive circuit breaker decisions:
const circuitBreaker = new CircuitBreaker(callService, {
errorThresholdPercentage: 50,
volumeThreshold: 10,
timeout: 3000,
// Expose metrics
stats: {
publish: (stats) => {
metrics.gauge('circuit_breaker_error_rate', stats.errorRate, { circuit: 'payment' });
metrics.gauge('circuit_breaker_latency_p99', stats.latency.p99, { circuit: 'payment' });
metrics.counter('circuit_breaker_requests', stats.total, { circuit: 'payment' });
}
}
});
Understanding these metrics helps distinguish between:
- Necessary protection (downstream is truly failing)
- False positives (thresholds too sensitive)
Measure Fallback Behavior
Track what happens when circuits are open:
circuitBreaker.fallback(async () => {
metrics.increment('circuit_breaker_fallback_executed', { circuit: 'payment' });
const cachedResult = await cache.get(requestKey);
if (cachedResult) {
metrics.increment('circuit_breaker_fallback_cache_hit', { circuit: 'payment' });
return cachedResult;
}
metrics.increment('circuit_breaker_fallback_cache_miss', { circuit: 'payment' });
throw new Error('Service unavailable');
});
If fallbacks return cached data, track cache hit rates and staleness.
Create Topology Dashboards
Visualize circuit breaker state across your service mesh:
// Build circuit breaker status for all services
function getCircuitBreakerStatus() {
return services.map(service => ({
name: service.name,
circuits: service.circuitBreakers.map(cb => ({
target: cb.targetService,
state: cb.state,
errorRate: cb.stats.errorRate,
lastTransition: cb.lastTransitionTime
}))
}));
}
A topology view with circuit states highlighted helps operators quickly assess system resilience posture.
Circuit Breaker Monitoring Best Practices
Alert on State Changes, Not Just Open Circuits
const transitionThreshold = 5;
const timeWindow = 60000; // 1 minute
if (recentTransitions.length > transitionThreshold) {
alerting.send('circuit_breaker_oscillating', {
circuit: circuitName,
transitions: recentTransitions.length,
window: '1m'
});
}
Set Alert Delays Appropriately
Consider alert thresholds based on time in open state:
# Alert rules
- alert: CircuitBreakerOpen
expr: circuit_breaker_state == 2
for: 2m # Only alert if open for 2+ minutes
labels:
severity: warning
- alert: CircuitBreakerOpenCritical
expr: circuit_breaker_state == 2
for: 5m # Escalate if open for 5+ minutes
labels:
severity: critical
Monitor Configuration Alongside Behavior
Track thresholds, timeouts, and probe settings as part of deployment verification:
metrics.gauge('circuit_breaker_config', 1, {
circuit: 'payment',
error_threshold: circuitBreaker.options.errorThresholdPercentage,
timeout: circuitBreaker.options.timeout,
reset_timeout: circuitBreaker.options.resetTimeout
});
Configuration drift can cause unexpected behavior changes.
Correlate with Deployments
Circuits that start opening after deployments might indicate introduced regressions:
deploymentTracker.onDeploy((deployment) => {
metrics.annotation('deployment', {
service: deployment.service,
version: deployment.version
});
});
Automatic correlation helps developers quickly identify whether their changes affected resilience.
Test with Chaos Engineering
Verify circuit breaker behavior through chaos engineering:
describe('Circuit Breaker Behavior', () => {
it('should open when downstream fails', async () => {
// Simulate downstream failure
mockDownstream.failAll();
// Make requests until circuit opens
for (let i = 0; i < 20; i++) {
try { await circuitBreaker.fire(); } catch {}
}
expect(circuitBreaker.state).toBe('open');
});
it('should recover when downstream recovers', async () => {
// Restore downstream
mockDownstream.succeedAll();
// Wait for half-open
await sleep(circuitBreaker.options.resetTimeout);
// Probe should succeed
await circuitBreaker.fire();
expect(circuitBreaker.state).toBe('closed');
});
});
Monitoring validates that these tests reflect production behavior.
Document Expected Behavior
Document expected circuit breaker behavior for each protected dependency:
## Payment Service Circuit Breaker
**Expected behavior:**
- Opens after 50% error rate over 10 requests
- Stays open for 30 seconds before half-open
- Single successful probe closes circuit
**Known scenarios:**
- Opens briefly during payment provider maintenance windows (expected)
- Should NOT open during normal traffic (investigate if seen)
This helps operators distinguish between expected protection and unexpected failures.
Conclusion
Circuit breaker monitoring transforms resilience patterns from black-box protection to visible, manageable infrastructure. By tracking state transitions, underlying failure metrics, and fallback behavior, teams gain confidence that circuit breakers are protecting the system as intended.
Key Takeaways
- Monitor all state transitions with context
- Track the metrics that drive circuit breaker decisions
- Alert on patterns, not just individual states
- Test circuit breaker behavior with chaos engineering
Invest in comprehensive circuit breaker monitoring as part of your overall observability strategy. Well-monitored circuit breakers become trusted resilience mechanisms rather than mysterious sources of potential problems.