Database Failover Strategies Compared
Issue #140: System Design Interview Roadmap • Section 5: Reliability & Resilience
When Your Database Goes Dark at 3 AM
Your payment service just processed $2M in transactions when alerts flood your phone—primary database unresponsive. Half your team scrambles to manually promote a replica while the other half argues whether to wait for the automated system. Meanwhile, every second of downtime costs $50K in lost revenue.
This scenario plays out monthly across tech companies because database failover isn’t about choosing if to fail over—it’s understanding when, how fast, and what you’re willing to sacrifice. Today we’ll dissect six failover strategies, expose their hidden trade-offs, and show you exactly when each approach saves or destroys your system.
The Failover Trilemma: Pick Your Poison
Every failover strategy navigates three competing forces:
Recovery Time Objective (RTO): How fast can you restore service? Manual failover: 15-60 minutes. Automatic: 30-60 seconds. The difference between losing $50K or $50M.
Recovery Point Objective (RPO): How much data can you lose? Synchronous replication: zero data loss but 50% write throughput penalty. Asynchronous: risk losing 5-30 seconds of transactions during failover.
Operational Complexity: Simple systems fail predictably. Complex systems fail mysteriously. Adding automation reduces RTO but increases the blast radius when automation fails.
The insight most engineers miss: You don’t choose one strategy—you layer them. Stripe uses automatic failover for reads, semi-automatic for writes, and manual override for edge cases. This layered approach handles 99.9% of failures automatically while preventing automation from making catastrophic decisions during complex scenarios.
Six Strategies Dissected

