What Is a Distributed System
A distributed system is a collection of independent computers connected by a network that cooperate to accomplish some goal. Each computer has its own processor, memory, operating system, and clock. There is no shared address space and no shared notion of time.
Processes on different machines each have access to their local operating system mechanisms, but those mechanisms apply only within a single system. Shared memory, pipes, message queues, and kernel-managed synchronization primitives such as semaphores or mutexes cannot be used for coordination across machines.
All coordination in a distributed system must therefore be performed explicitly through message passing.
A well-designed distributed system presents a single system image: it appears to users as a single coherent system, hiding the complexity of distribution behind a unified interface.
Failures are expected and are often partial failures, meaning that some components fail while others continue to operate.
No global knowledge exists in a distributed system. Each component knows only its own state and information received from others, which may be delayed or outdated.
Why Distributed Systems Exist
Distributed systems are built to overcome the limitations of single machines and centralized designs.
Scale is a primary driver. Vertical scaling is limited by hardware constraints, power, and cost. Horizontal scaling allows systems to grow incrementally by adding machines.
Moore’s Law historically enabled performance gains through faster hardware, but those gains have slowed. Performance gains shifted toward multicore and heterogeneous systems.
Amdahl’s Law limits speedup from parallelism when some portion of a workload remains sequential.
Collaboration and network effects increase the value of systems as more participants join.
Other motivations include reducing latency through geographic distribution, supporting mobility across devices such as phones and IoT sensors, allowing incremental growth from small deployments to large-scale systems, and delegating infrastructure to cloud providers.
Transparency
Transparency is the design goal of hiding the fact that resources are distributed across multiple computers. Users and applications interact with the system as if it were a single machine.
Examples include hiding where resources are located, masking failures, and allowing resources to move or be replicated without affecting access.
Full transparency is rarely achievable due to network delays, partial failures, and consistency constraints.
Failure and Fault Tolerance
Distributed systems experience partial failure, unlike centralized systems, which typically fail all-or-nothing.
Fault tolerance relies on redundancy and avoiding single points of failure.
In series systems, the failure of any component causes system failure, making large systems unreliable when all components must function.
In parallel systems, redundant components improve availability by allowing the system to continue operating as long as some components remain functional.
Availability measures how often a system is usable and is often expressed in “nines” (e.g., “five nines” means 99.999% uptime). Reliability concerns correctness and time-to-failure.
Failure Models
A fail-stop failure occurs when a component halts and produces no further output, and its failure can be detected.
A fail-silent failure occurs when a component produces no output, but other components cannot reliably distinguish failure from delay.
Fail-restart failures involve components that crash and later restart, possibly with lost or stale state.
Network partitions divide systems into isolated groups that cannot communicate.
Byzantine failures occur when components continue running but do not follow the system specification, leading to incorrect, inconsistent, or misleading behavior. This can cover a range of issues: from a stuck bit to bugs to malicious interference.
Caching vs. Replication
Replication creates multiple authoritative copies of data to improve availability and fault tolerance. Replicas must be kept consistent, and replica failures trigger recovery procedures.
Caching stores temporary, derived copies to reduce latency and load. Caches are expendable; a cache miss simply fetches from the authoritative source.
The key distinction: losing a cache costs performance; losing all replicas loses data.
Network Timing Models
Synchronous networks have a known upper bound on message delivery time, making failure detection straightforward.
Partially synchronous networks have an upper bound that exists but is not known in advance.
Asynchronous networks have no upper bound on message delivery time. The Internet is asynchronous. This makes it impossible to distinguish a failed node from a slow one, complicates failure detection, and limits what guarantees protocols can provide. This is what we expect from the Internet.
Security
Security in distributed systems differs from centralized systems because services run on remote machines, communication travels over public networks, and trust boundaries are unclear.
Key concerns include authentication (who is making a request), authorization (what they are allowed to do), encryption (protecting data in transit), integrity checking (detecting tampering), and audit logging (recording actions).
Service Architectures
The client-server model has clients send requests to servers. Simple but the server can become a bottleneck or single point of failure.
Multi-tier architectures separate concerns into layers (presentation, application logic, data storage) that can be scaled independently.
Microservices architectures decompose applications into small, autonomous services with well-defined interfaces. Flexible but complex, with availability challenges from long dependency chains.
Peer-to-peer (P2P) systems have no central server; all participants communicate directly. Most practical P2P systems use a hybrid P2P model with servers for coordination.
Worker pools (also called processor pools or compute clusters) assign tasks to available computing resources on demand.
Cloud computing provides resources as a network service: IaaS (virtual machines), PaaS (application platforms), and SaaS (complete applications).
Communication Fundamentals
Distributed systems rely exclusively on message passing for coordination.
Network communication is slower, variable, and unreliable compared to local computation. Messages may be delayed, duplicated, reordered, or lost, and these behaviors must be handled explicitly.
Internet Design Principles
The Internet is a packet-switched network designed to scale and survive failures.
It follows the end-to-end principle, which places complexity at the endpoints rather than inside the network.
The network provides best-effort delivery, meaning packets are attempted but not guaranteed to arrive, arrive once, arrive in order, or arrive within a fixed time.
Recovery, reliability, ordering, and security are implemented by software at the endpoints.
Fate sharing places communication state at the endpoints so failures affect only the participants already involved.
Latency and Throughput
Latency measures the time it takes for a single message or request to travel from sender to receiver.
Throughput (bandwidth) measures how much data can be transferred per unit time.
Latency and throughput are related but distinct. A system can have high throughput but high latency, or low latency but low throughput.
Many design choices in distributed systems trade latency for throughput. Reliability, ordering, and congestion control can improve throughput for sustained transfers while increasing latency for individual messages.
Layered Networking Model
Networking is structured as a layered stack.
The data link layer handles communication on a local network.
The network layer, implemented by IP, routes packets between machines across networks.
The transport layer provides process-to-process communication using ports.
Higher layers implement application-specific protocols.
IP Networking
IP provides connectionless, unreliable datagram delivery between machines.
Each packet is routed independently, with no guarantees of delivery or ordering.
Transport Protocols: TCP and UDP
TCP provides a reliable, ordered byte stream with congestion control and retransmission. It simplifies application development but can increase latency due to ordering constraints.
Head-of-line blocking occurs when delivery of later data is delayed because earlier data has not yet arrived, even if later data has already been received. This can increase latency even when sufficient network capacity is available.
UDP provides best-effort, unordered datagram delivery with minimal overhead. Reliability and ordering, if needed, must be implemented by the application.
Port numbers in TCP and UDP headers complete the addressing needed for process-to-process communication. While an IP address identifies a machine, a port number identifies a specific socket on that machine. A process may open multiple sockets, each with its own port.
Choosing Between TCP and UDP
TCP is widely used because it provides a simple and powerful abstraction for reliable communication.
UDP is used when low latency matters, when applications can tolerate loss, or when applications want control over reliability and timing.
Protocols such as DNS and NTP use UDP due to short messages and simple retry semantics.
Choosing between TCP and UDP is a design decision about where responsibility for correctness and recovery should reside.
QUIC
QUIC is a modern transport protocol built on UDP.
It provides TCP-like reliability and congestion control while supporting multiple independent streams to avoid head-of-line blocking.
QUIC runs in user space and exemplifies the end-to-end principle.
Key Points
Distributed systems trade local simplicity for scalability, availability, and flexibility.
The architecture determines whether adding components improves or harms availability.
Transparency is a design goal but is rarely fully achievable.
Network timing assumptions affect what guarantees a system can provide.
Networking assumptions, particularly best-effort delivery and endpoint responsibility, shape all distributed system designs.
Understanding communication semantics is essential for reasoning about correctness, performance, and failure.
What You Don’t Need to Study
-
Historical system details (SAGE, Sabre specifics)
-
Specific processor specs (core counts, transistor counts)
-
Dennard scaling/law (this shows how power/heat limited clock speed growth)
-
STCO (system technology co-optimization): this is very tangential
-
Metcalfe’s Law (understand that network effects exist, but the name/formula isn’t core)
-
Specific heterogeneous computing components (GPUs, neural engines, etc.)
-
The six specific transparency types (location, migration, replication, concurrency, failure, parallelism); but understand the general concept that distributed systems aim to hide distribution from user
-
Specific cloud provider product names
-
Probability formulas for series/parallel systems (but understand the implications)
-
OSI layer numbers
-
Socket API details