Distributed System Basics: Why They're Hard
#52
When a single machine can't handle your load, welcome to the beautiful chaos of distributed systems.
Building distributed systems feels like conducting an orchestra where half the musicians might fall asleep mid-performance, the sheet music occasionally disappears, and you still need to deliver a flawless symphony. Unlike monolithic applications where you control every variable, distributed systems force you to embrace uncertainty as a fundamental design principle.
The real challenge isn't just splitting your application across multiple machines—it's the emergent behaviors that arise from network partitions, partial failures, and the subtle timing dependencies that can bring down entire platforms. Let me share why even seasoned architects find distributed systems humbling, and more importantly, how to build systems that thrive in this chaos
The Fundamental Reality: Partial Failures Are Your New Normal
In a monolithic system, operations either succeed completely or fail completely. A database write either commits or rolls back. A function call either returns a value or throws an exception. This binary world gives us mental clarity and predictable debugging paths.
Distributed systems shatter this clarity. When you make a network call, you enter a realm of three possible outcomes: success, failure, or the most treacherous of all—timeout. That timeout doesn't tell you whether your request succeeded, failed, or is still processing somewhere in the network fabric. This fundamental uncertainty propagates through every layer of your system architecture.
Consider this seemingly simple scenario: your payment service calls the inventory service to reserve items, then calls the payment processor to charge the customer. If the inventory call times out, you're left in limbo. Did the items get reserved? Should you retry? Should you assume failure and return an error to the customer? Each choice carries consequences that ripple through your system's reliability and user experience.
The Network: Your Unreliable Foundation
The network between your services introduces latency variance that follows long-tail distributions—not the normal distributions we often assume in our calculations. While 99% of your requests might complete in under 100ms, that troublesome 1% can take seconds or simply vanish into the networking ether.
This variability isn't just about raw performance; it fundamentally changes how you must architect for reliability. Traditional timeout values based on average latency will either be too aggressive (causing unnecessary retries and cascade failures) or too lenient (leading to resource exhaustion and poor user experience).
The sophisticated approach involves percentile-based SLA boundaries derived from actual latency histograms. Netflix's Hystrix library pioneered this by using rolling window percentiles to dynamically adjust timeout thresholds, recognizing that network conditions change throughout the day and across different service dependencies.
State Consistency: The Eternal Trade-off

