System Design Interview Roadmap

System Design Interview Roadmap

Redundancy Patterns in System Design

Issue #114: System Design Interview Roadmap • Section 5: Reliability & Resilience

Sumedh's avatar
Sumedh
Aug 10, 2025
∙ Paid
14
3
Share

When Netflix Stayed Online During the Great AWS Outage

In December 2012, when AWS's entire East Coast region went dark for hours, most companies watched helplessly as their services vanished. Netflix? Their streaming kept running seamlessly for 230 million users. The secret wasn't luck—it was redundancy patterns so sophisticated that losing an entire AWS region was just another Tuesday.

Today, you'll master the redundancy patterns that separate resilient systems from fragile ones, and build a working demonstration that shows exactly how these patterns prevent catastrophic failures.

What You'll Master Today

Core Redundancy Patterns: Active-Active, Active-Passive, and N+1 redundancy strategies

  • Geographic Distribution: Multi-region deployments that survive datacenter failures

  • Data Redundancy: Replication strategies that ensure zero data loss

  • Service Redundancy: Load balancing and failover mechanisms

  • Implementation: Build a resilient system with automatic failover

The Redundancy Spectrum: Beyond Simple Duplication

Most engineers think redundancy means "run two of everything." This oversimplification leads to expensive systems that still fail catastrophically. True redundancy requires understanding failure correlation and designing around it.

Active-Active: The Performance Multiplier

In Active-Active redundancy, all instances serve traffic simultaneously. This pattern doesn't just provide redundancy—it multiplies your system's capacity while eliminating single points of failure.

The Hidden Complexity: Maintaining data consistency across active instances. When both instances can write, you need sophisticated conflict resolution or careful data partitioning. Amazon's DynamoDB uses eventual consistency with last-writer-wins, while Google's Spanner uses synchronized clocks for strict consistency.

Production Insight: Netflix runs Active-Active across three AWS regions. Each region can handle 100% of global traffic, but normally serves 33%. When one region fails, the others automatically absorb its load with zero user impact.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 SystemDR LLP
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture