Health Checking in Distributed Systems
Issue #122: System Design Interview Roadmap • Reliability & Resilience
What We'll Learn Today
Health Check Types: Shallow, deep, and dependency-aware checks
Failure Propagation: How unhealthy services cascade through systems
Circuit Integration: Health checks as circuit breaker triggers
Observability Patterns: Monitoring health across service meshes
The Silent Killer of Distributed Systems
Your payment service appears healthy—CPU at 30%, memory stable, responding to pings. Yet customers can't complete purchases. The database connection pool is exhausted, but your health check only validates the HTTP listener. This scenario reveals health checking's hidden complexity: distinguishing between technical availability and business capability.
Netflix discovered this during a major outage when their recommendation service passed all health checks while silently failing to connect to their user data store. Traffic kept flowing to broken instances for 23 minutes because shallow health checks missed the critical dependency failure.