Distributed Systems Foundations and Communication

What Is a Distributed System?

A distributed system is a collection of independent computers connected by a network that cooperate to accomplish some goal. Each computer has its own processor, memory, operating system, and clock. There is no shared address space and no shared notion of time.

Processes on different machines each have access to their local operating system mechanisms, but those mechanisms apply only within a single system. Shared memory, pipes, message queues, and kernel-managed synchronization primitives such as semaphores or mutexes cannot be used for coordination across machines.

All coordination in a distributed system must therefore be performed explicitly through message passing.

A well-designed distributed system presents a single system image: it appears to users as a single coherent system, hiding the complexity of distribution behind a unified interface.

Failures are expected and often partial failures, meaning some components fail while others continue to operate.

No global knowledge exists in a distributed system. Each component knows only its own state and information received from others, which may be delayed or outdated.

Why Distributed Systems Exist

Distributed systems are built to overcome the limitations of single machines and centralized designs.

Scale is a primary driver. Vertical scaling is limited by hardware constraints, power, and cost. Horizontal scaling allows systems to grow incrementally by adding machines.

Moore’s Law historically enabled performance gains through faster hardware, but those gains have slowed. Performance gains shifted toward multicore and heterogeneous systems.

Amdahl’s Law shows there are limits to parallel processing. Speedup is limited when some portion of a workload remains sequential.

Collaboration and network effects increase the value of systems as more participants join.

Other motivations include reducing latency through geographic distribution, supporting mobility across devices such as phones and IoT sensors, allowing incremental growth from small deployments to large-scale systems, and delegating infrastructure to cloud providers.

Transparency

Transparency is the design goal of hiding the fact that resources are distributed across multiple computers. Users and applications interact with the system as if it were a single machine. Examples include hiding where resources are located, masking failures, and allowing resources to move or be replicated without affecting access.

Full transparency is rarely achievable due to network delays, partial failures, and consistency constraints.

Failure and Fault Tolerance

Distributed systems experience partial failure, unlike centralized systems, which typically fail all-or-nothing.

Fault tolerance relies on redundancy and avoiding single points of failure.

In series systems, the failure of any component causes system failure, making large systems unreliable when all components must function. In parallel systems, redundant components improve availability by allowing the system to continue operating as long as some components remain functional.

Availability measures how often a system is usable and is often expressed in “nines” (e.g., “five nines” means 99.999% uptime). Reliability concerns correctness and time-to-failure.

Failure Models

Systems fail in different ways:

A fail-stop failure occurs when a component halts and produces no further output, and its failure can be detected.
A fail-silent failure occurs when a component produces no output, but other components cannot reliably distinguish failure from delay.
Fail-restart failures involve components that crash and later restart, possibly with lost or stale state.
Network partitions divide systems into isolated groups that cannot communicate.
An omission failure is when a message fails to send, receive, or gets lost or damaged in the network.
Byzantine failures occur when components continue running but do not follow the system specification, leading to incorrect, inconsistent, or misleading behavior. This can cover a range of issues: from a stuck bit to bugs to malicious interference.

Caching vs. Replication

Keeping multiple copies of data is a recurring theme in distributed systems design.

Replication creates multiple authoritative copies of data to improve availability and fault tolerance. Replicas must be kept consistent, and replica failures trigger recovery procedures.

Caching stores temporary, derived copies to reduce latency and load. Caches are expendable; a cache miss simply fetches from the authoritative source.

The key distinction is that losing a cache incurs performance penalties; losing all replicas loses data.

Network Timing Models

Network technologies exhibit different behaviors and provide different latency, bandwidth, and reliability guarantees.

Synchronous networks have a known upper bound on message delivery time, making failure detection straightforward.

Partially synchronous networks have an upper bound that exists but is not known in advance.

Asynchronous networks have no upper bound on message delivery time. The Internet is asynchronous. This makes it impossible to distinguish between a failed node and a slow one, complicates failure detection, and limits the guarantees protocols can provide. This is what we expect from the Internet.

Security

Security in distributed systems differs from centralized systems because services run on remote machines, communication travels over public networks, and trust boundaries are unclear.

Key issues include authentication (who is making a request), authorization (what they are allowed to do), encryption (protecting data in transit), integrity checking (detecting tampering), and audit logging (recording actions).

Service Architectures

The client-server model has clients send requests to servers. It’s simple, but the server can become a bottleneck or single point of failure. It’s the core mechanism on which most other models are built.

Multi-tier architectures separate concerns into layers (presentation, application logic, data storage) that can be scaled independently.

Microservices architectures decompose applications into small, autonomous services with well-defined interfaces. Flexible but complex, with availability challenges from long dependency chains.

Peer-to-peer (P2P) systems have no central server; all participants communicate directly. Most practical P2P systems use a hybrid P2P model with servers for coordination.

Worker pools (also called processor pools or compute clusters) assign tasks to available computing resources on demand.

Cloud computing provides resources as a network service: IaaS (virtual machines), PaaS (application platforms), and SaaS (complete applications).

Communication Fundamentals

Distributed systems rely exclusively on message passing for coordination.

Network communication is slower, more variable, and less reliable than local computation. Messages may be delayed, duplicated, reordered, or lost, and these behaviors must be handled explicitly.

Internet Design Principles

The Internet is a packet-switched network designed to scale and survive failures. It follows the end-to-end principle, which places complexity at the endpoints rather than inside the network.

The network provides best-effort delivery, meaning packets are attempted but not guaranteed to arrive, to arrive once, to arrive in order, or to arrive within a fixed time. Recovery, reliability, ordering, and security are implemented by software at the endpoints.

Fate sharing places tracking the communication state at the endpoints rather than in the routers, so failures affect only the participants already involved.

Latency and Throughput

Two core metrics of a network are latency and throughput.

Latency is the time it takes for a single message or request to travel from the sender to the receiver.

Throughput (bandwidth) measures how much data can be transferred per unit time.

Latency and throughput are related but distinct. A system can have high throughput but high latency, or low latency but low throughput.

Many design choices in distributed systems trade latency for throughput. Reliability, ordering, and congestion control can improve throughput for sustained transfers while increasing latency for individual messages.

Layered Networking Model

Networking is structured as a layered stack.

The data link layer handles communication on a local network.

The network layer, implemented by IP, routes packets between machines across different physical networks.

The transport layer provides process-to-process communication using ports.

Higher layers implement application-specific protocols.

IP Networking

IP provides connectionless, unreliable datagram delivery between machines. Each machine is assigned an IP address. Data is broken into variable-size chunks called packets, and each packet contains an IP header that includes source and destination IP addresses.

Each packet is routed independently, with no guarantees of delivery or ordering.

Transport Protocols: TCP and UDP

TCP provides a reliable, ordered byte stream with congestion control and retransmission. It simplifies application development but can increase latency due to ordering constraints.

Head-of-line blocking occurs when delivery of later data is delayed because earlier data has not yet arrived, even if the following data has already been received. This can increase latency even when sufficient network capacity is available.

UDP provides best-effort, unordered datagram delivery with minimal overhead. Reliability and ordering, if needed, must be implemented by the application.

Port numbers in TCP and UDP headers complete the addressing needed for process-to-process communication. While an IP address identifies a machine, a port number identifies a specific socket on that machine. A process may open multiple sockets, each with its own port.

Choosing Between TCP and UDP

TCP is widely used because it provides a simple and powerful abstraction for reliable communication.

UDP is used when low latency matters, when applications can tolerate loss, or when applications want control over reliability and timing.

Protocols such as DNS (domain lookups) and NTP (setting time) use UDP because they send short messages, don’t want the overhead of setting up a connection, and the servers don’t want the overhead of maintaining connection state for potentially a huge number of clients. For NTP, any retries because of lost packets can result in asymmetric send/receive times, throwing off synchronization.

Choosing between TCP and UDP is a design decision about where responsibility for correctness and recovery should reside.

QUIC

QUIC is a modern transport protocol built on UDP. It provides TCP-like reliability and congestion control while supporting multiple independent streams to avoid head-of-line blocking.

QUIC runs in user space and exemplifies the end-to-end principle.

Key Points

Distributed systems trade local simplicity for scalability, availability, and flexibility.

The architecture determines whether adding components improves or harms availability.

Transparency is a design goal, but is rarely fully achievable.

Network timing assumptions affect what guarantees a system can provide.

Networking assumptions, particularly best-effort delivery and endpoint responsibility, shape all distributed system designs.

Understanding communication semantics is essential for reasoning about correctness, performance, and failure.

What You Don’t Need to Study

Historical system details (SAGE, Sabre specifics)
Specific processor specs (core counts, transistor counts)
Dennard scaling/law (this shows how power/heat limited clock speed growth)
STCO (system technology co-optimization): this is very tangential
Metcalfe’s Law (understand that network effects exist, but the name/formula isn’t core)
Specific heterogeneous computing components (GPUs, neural engines, etc.)
The six specific transparency types (location, migration, replication, concurrency, failure, parallelism); but understand the general concept that distributed systems aim to hide distribution from user
Specific cloud provider product names
Probability formulas for series/parallel systems (but understand the implications)
OSI layer numbers
Socket API details