Rolling Deployments: Strategies and Patterns

Issue #125: System Design Interview Roadmap • Section 5: Reliability & Resilience

Sep 12, 2025

What You'll Master Today

Rolling deployment mechanics and why they're superior to big-bang releases
Advanced rollout strategies including canary integration and traffic shaping
Hidden failure patterns that can turn gradual deployments into system-wide outages
Production-grade health checking that prevents bad deployments from spreading

The Gradual Revolution

Rolling deployments represent the difference between jumping off a cliff and walking down a slope. Instead of replacing all instances simultaneously (big-bang deployment), you update them incrementally, maintaining service availability while monitoring each step.
Think of it like renovating a busy restaurant—you keep serving customers while upgrading one section at a time, ensuring you can always fall back to the working areas if something goes wrong.

Core Strategies: Beyond Simple Instance Replacement

1. Percentage-Based Rollouts

Most teams start with simple "replace 25% every 5 minutes" strategies, but production systems require more sophistication:

Rolling Strategy:
  maxUnavailable: 1        # Never lose more than 1 instance
  maxSurge: 1             # Add 1 extra during transition
  updateBatch: 20%        # Update 20% at a time
  healthCheckGracePeriod: 60s

The hidden insight: batch size directly impacts blast radius. Large batches deploy faster but spread failures wider. Netflix uses dynamic batch sizing—starting with 1% for new services, scaling to 25% for proven stable services.

2. Dependency-Aware Sequencing

Rolling deployments must respect service dependencies. Updating a database schema before application code can break everything in between.

The pattern: dependency-first ordering where downstream services update before upstream consumers. Kubernetes operators automate this through resource ordering and readiness gates.

3. Traffic-Aware Scheduling

Smart rolling deployments consider traffic patterns. Updating instances during peak hours reduces available capacity when you need it most.

Production insight: time-box deployments around traffic valleys. Spotify deploys during European nights to minimize user impact, using automated scheduling based on traffic forecasts.

Advanced Patterns: Where Rolling Deployments Excel

Circuit Breaker Integration

Rolling deployments should integrate with circuit breakers to prevent bad code from cascading. When new instances fail health checks, circuit breakers can automatically route traffic back to stable versions.

def rolling_health_check(instance):
    if circuit_breaker.is_open():
        return False  # Don't promote unhealthy instances
    return instance.health_check() and performance_check(instance)

Progressive Traffic Shifting

Beyond simple instance replacement, modern rolling deployments gradually shift traffic percentages:

5% traffic → New instance for 5 minutes
25% traffic → Monitor error rates and latency
100% traffic → Full promotion if metrics are healthy

This approach catches issues that only appear under real load, something static health checks miss.

Hidden Failure Modes: What Can Go Wrong

The Thundering Herd Problem

When all instances restart health checks simultaneously, they can overwhelm downstream dependencies. The solution: staggered health check initialization with random delays.

Memory Leak Amplification

Rolling deployments can mask gradual memory leaks. Each new instance starts with fresh memory, hiding the fact that long-running instances would eventually crash.

Monitor memory growth rates across instance lifecycles to catch this pattern early.

Configuration Drift During Rollouts

If configuration changes between the start and end of a rolling deployment, some instances may have different settings. Use configuration snapshots that lock config for the entire rollout duration.

Real-World Implementations

Kubernetes Rolling Updates

Kubernetes provides sophisticated rolling update controls through deployment specifications:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 25%
  progressDeadlineSeconds: 600
  revisionHistoryLimit: 10

The key insight: progressDeadlineSeconds prevents stuck deployments from hanging indefinitely, automatically triggering rollbacks after timeout.

AWS ECS Rolling Deployments

ECS integrates rolling deployments with Application Load Balancers for seamless traffic management:

New tasks register with ALB
Health checks verify readiness
Traffic gradually shifts to new tasks
Old tasks drain connections before termination

Google Cloud Run Gradual Rollouts

Cloud Run's traffic splitting enables percentage-based rollouts without infrastructure management:

gcloud run services update-traffic myservice \
  --to-revisions=new-revision=10,old-revision=90

This approach lets you test new code with real traffic while maintaining majority stability.

Production Implementation Insights

Health Check Design

Effective rolling deployments require multi-layer health checking:

Liveness checks: Is the process running?
Readiness checks: Can it handle traffic?
Business logic checks: Are core features working?

The pattern: Start with shallow checks for fast feedback, then deeper validation before full traffic routing.

Rollback Automation

Rolling deployments should include automatic rollback triggers:

Error rate exceeds baseline by 3x
Latency P95 increases by 50%
Health check failure rate > 20%

Implement these as deployment policies, not manual reactions.

Observability Integration

Every rolling deployment needs comprehensive monitoring:

@monitor_deployment
def rolling_update():
    for batch in deployment_batches:
        deploy_batch(batch)
        wait_for_health()
        validate_metrics()
        if anomaly_detected():
            rollback_immediately()
            break

Track deployment velocity, failure rates, and MTTR to optimize your rollout strategies over time.

Your Production Checklist

Before implementing rolling deployments:

✅ Define health check criteria that catch real problems
✅ Set up automated rollback triggers based on business metrics
✅ Plan dependency update ordering to avoid compatibility issues
✅ Configure traffic shaping for gradual load introduction
✅ Implement comprehensive monitoring throughout the rollout process

Hands-On Challenge

Build a rolling deployment system that can update a 3-instance web service with zero downtime. Include health checks, traffic routing, and rollback capabilities. Measure how long deployments take and what failure scenarios your system can handle.

The demo implementation below provides a complete working example you can experiment with and extend for your own systems.

GitHub Link:

https://github.com/sysdr/sdir/tree/main/Rolling_Deployments/rolling-deployment-demo

Next week in Issue #126, we'll explore Tenant Isolation in Multi-Tenant Systems, building on the deployment strategies we've learned to ensure one tenant's updates don't impact others.

System Design Interview Roadmap