Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don’t take the one hour time window in the title literally.
Last update: Mon Feb 16 15:27:05 2026
Week 1: Introduction and Network Communication
What Is a Distributed System?
A distributed system is a collection of independent computers connected by a network that cooperate to accomplish some goal. Each computer has its own processor, memory, operating system, and clock. There is no shared address space and no shared notion of time.
Processes on different machines each have access to their local operating system mechanisms, but those mechanisms apply only within a single system. Shared memory, pipes, message queues, and kernel-managed synchronization primitives such as semaphores or mutexes cannot be used for coordination across machines.
All coordination in a distributed system must therefore be performed explicitly through message passing.
A well-designed distributed system presents a single system image: it appears to users as a single coherent system, hiding the complexity of distribution behind a unified interface.
Failures are expected and often partial failures, meaning some components fail while others continue to operate.
No global knowledge exists in a distributed system. Each component knows only its own state and information received from others, which may be delayed or outdated.
Why Distributed Systems Exist
Distributed systems are built to overcome the limitations of single machines and centralized designs.
Scale is a primary driver. Vertical scaling is limited by hardware constraints, power, and cost. Horizontal scaling allows systems to grow incrementally by adding machines.
Moore’s Law historically enabled performance gains through faster hardware, but those gains have slowed. Performance gains shifted toward multicore and heterogeneous systems.
Amdahl’s Law shows there are limits to parallel processing. Speedup is limited when some portion of a workload remains sequential.
Collaboration and network effects increase the value of systems as more participants join.
Other motivations include reducing latency through geographic distribution, supporting mobility across devices such as phones and IoT sensors, allowing incremental growth from small deployments to large-scale systems, and delegating infrastructure to cloud providers.
Transparency
Transparency is the design goal of hiding the fact that resources are distributed across multiple computers. Users and applications interact with the system as if it were a single machine. Examples include hiding where resources are located, masking failures, and allowing resources to move or be replicated without affecting access.
Full transparency is rarely achievable due to network delays, partial failures, and consistency constraints.
Failure and Fault Tolerance
Distributed systems experience partial failure, unlike centralized systems, which typically fail all-or-nothing.
Fault tolerance relies on redundancy and avoiding single points of failure.
In series systems, the failure of any component causes system failure, making large systems unreliable when all components must function. In parallel systems, redundant components improve availability by allowing the system to continue operating as long as some components remain functional.
Availability measures how often a system is usable and is often expressed in “nines” (e.g., “five nines” means 99.999% uptime). Reliability concerns correctness and time-to-failure.
Failure Models
Systems fail in different ways:
-
A fail-stop failure occurs when a component halts and produces no further output, and its failure can be detected.
-
A fail-silent failure occurs when a component produces no output, but other components cannot reliably distinguish failure from delay.
-
Fail-restart failures involve components that crash and later restart, possibly with lost or stale state.
-
Network partitions divide systems into isolated groups that cannot communicate.
-
An omission failure is when a message fails to send, receive, or gets lost or damaged in the network.
-
Byzantine failures occur when components continue running but do not follow the system specification, leading to incorrect, inconsistent, or misleading behavior. This can cover a range of issues: from a stuck bit to bugs to malicious interference.
Caching vs. Replication
Keeping multiple copies of data is a recurring theme in distributed systems design.
Replication creates multiple authoritative copies of data to improve availability and fault tolerance. Replicas must be kept consistent, and replica failures trigger recovery procedures.
Caching stores temporary, derived copies to reduce latency and load. Caches are expendable; a cache miss simply fetches from the authoritative source.
The key distinction is that losing a cache incurs performance penalties; losing all replicas loses data.
Network Timing Models
Network technologies exhibit different behaviors and provide different latency, bandwidth, and reliability guarantees.
Synchronous networks have a known upper bound on message delivery time, making failure detection straightforward.
Partially synchronous networks have an upper bound that exists but is not known in advance.
Asynchronous networks have no upper bound on message delivery time. The Internet is asynchronous. This makes it impossible to distinguish between a failed node and a slow one, complicates failure detection, and limits the guarantees protocols can provide. This is what we expect from the Internet.
Security
Security in distributed systems differs from centralized systems because services run on remote machines, communication travels over public networks, and trust boundaries are unclear.
Key issues include authentication (who is making a request), authorization (what they are allowed to do), encryption (protecting data in transit), integrity checking (detecting tampering), and audit logging (recording actions).
Service Architectures
The client-server model has clients send requests to servers. It’s simple, but the server can become a bottleneck or single point of failure. It’s the core mechanism on which most other models are built.
Multi-tier architectures separate concerns into layers (presentation, application logic, data storage) that can be scaled independently.
Microservices architectures decompose applications into small, autonomous services with well-defined interfaces. Flexible but complex, with availability challenges from long dependency chains.
Peer-to-peer (P2P) systems have no central server; all participants communicate directly. Most practical P2P systems use a hybrid P2P model with servers for coordination.
Worker pools (also called processor pools or compute clusters) assign tasks to available computing resources on demand.
Cloud computing provides resources as a network service: IaaS (virtual machines), PaaS (application platforms), and SaaS (complete applications).
Communication Fundamentals
Distributed systems rely exclusively on message passing for coordination.
Network communication is slower, more variable, and less reliable than local computation. Messages may be delayed, duplicated, reordered, or lost, and these behaviors must be handled explicitly.
Internet Design Principles
The Internet is a packet-switched network designed to scale and survive failures. It follows the end-to-end principle, which places complexity at the endpoints rather than inside the network.
The network provides best-effort delivery, meaning packets are attempted but not guaranteed to arrive, to arrive once, to arrive in order, or to arrive within a fixed time. Recovery, reliability, ordering, and security are implemented by software at the endpoints.
Fate sharing places tracking the communication state at the endpoints rather than in the routers, so failures affect only the participants already involved.
Latency and Throughput
Two core metrics of a network are latency and throughput.
Latency is the time it takes for a single message or request to travel from the sender to the receiver.
Throughput (bandwidth) measures how much data can be transferred per unit time.
Latency and throughput are related but distinct. A system can have high throughput but high latency, or low latency but low throughput.
Many design choices in distributed systems trade latency for throughput. Reliability, ordering, and congestion control can improve throughput for sustained transfers while increasing latency for individual messages.
Layered Networking Model
Networking is structured as a layered stack.
The data link layer handles communication on a local network.
The network layer, implemented by IP, routes packets between machines across different physical networks.
The transport layer provides process-to-process communication using ports.
Higher layers implement application-specific protocols.
IP Networking
IP provides connectionless, unreliable datagram delivery between machines. Each machine is assigned an IP address. Data is broken into variable-size chunks called packets, and each packet contains an IP header that includes source and destination IP addresses.
Each packet is routed independently, with no guarantees of delivery or ordering.
Transport Protocols: TCP and UDP
TCP provides a reliable, ordered byte stream with congestion control and retransmission. It simplifies application development but can increase latency due to ordering constraints.
Head-of-line blocking occurs when delivery of later data is delayed because earlier data has not yet arrived, even if the following data has already been received. This can increase latency even when sufficient network capacity is available.
UDP provides best-effort, unordered datagram delivery with minimal overhead. Reliability and ordering, if needed, must be implemented by the application.
Port numbers in TCP and UDP headers complete the addressing needed for process-to-process communication. While an IP address identifies a machine, a port number identifies a specific socket on that machine. A process may open multiple sockets, each with its own port.
Choosing Between TCP and UDP
TCP is widely used because it provides a simple and powerful abstraction for reliable communication.
UDP is used when low latency matters, when applications can tolerate loss, or when applications want control over reliability and timing.
Protocols such as DNS (domain lookups) and NTP (setting time) use UDP because they send short messages, don’t want the overhead of setting up a connection, and the servers don’t want the overhead of maintaining connection state for potentially a huge number of clients. For NTP, any retries because of lost packets can result in asymmetric send/receive times, throwing off synchronization.
Choosing between TCP and UDP is a design decision about where responsibility for correctness and recovery should reside.
QUIC
QUIC is a modern transport protocol built on UDP. It provides TCP-like reliability and congestion control while supporting multiple independent streams to avoid head-of-line blocking.
QUIC runs in user space and exemplifies the end-to-end principle.
Key Points
Distributed systems trade local simplicity for scalability, availability, and flexibility.
The architecture determines whether adding components improves or harms availability.
Transparency is a design goal, but is rarely fully achievable.
Network timing assumptions affect what guarantees a system can provide.
Networking assumptions, particularly best-effort delivery and endpoint responsibility, shape all distributed system designs.
Understanding communication semantics is essential for reasoning about correctness, performance, and failure.
What You Don’t Need to Study
-
Historical system details (SAGE, Sabre specifics)
-
Specific processor specs (core counts, transistor counts)
-
Dennard scaling/law (this shows how power/heat limited clock speed growth)
-
STCO (system technology co-optimization): this is very tangential
-
Metcalfe’s Law (understand that network effects exist, but the name/formula isn’t core)
-
Specific heterogeneous computing components (GPUs, neural engines, etc.)
-
The six specific transparency types (location, migration, replication, concurrency, failure, parallelism); but understand the general concept that distributed systems aim to hide distribution from user
-
Specific cloud provider product names
-
Probability formulas for series/parallel systems (but understand the implications)
-
OSI layer numbers
-
Socket API details
Week 2: Remote Procedure Calls and Web Services
The Remote Procedure Call (RPC) provides an abstraction for process-to-process communication that enables calling functions on remote systems without having to handle data formatting and message parsing. RPC transforms network communication from I/O-based read/write interfaces to familiar procedure call semantics.
RPC Operation
The ability to call a remote function is provided by stub functions. Instead of calling a remote function, the client calls a local client stub. On the server side, the server stub receives the request and calls the local function on that machine.
A Client stub (proxy) is a local function with the same interface as the remote procedure. It packages the parameters, sends them to the server, waits for a reply, extracts the results, and returns them to the caller. The client calls the stub as if it were a local function – because it is.
A Server stub (skeleton) registers the service, awaits incoming requests, extracts data from requests, calls the actual procedure, packages results, and sends responses back to clients. The server’s actual procedure doesn’t know it’s being called remotely.
Static stub generation uses an RPC compiler to read an interface definition and generate stub code before compilation. Dynamic stub generation creates proxy objects at runtime via reflection, eliminating the need for separate compilation. Languages like Java, Python, Ruby, and JavaScript support dynamic generation. Languages like C, C++, Rust, and Go require static generation.
Marshalling is the process of packaging data into a network message. Serialization converts data elements into a flat byte array suitable for transmission. These terms are often used interchangeably in RPC contexts, though marshalling may include additional metadata like function identifiers or version numbers.
RPC is synchronous by default: the client blocks until the server responds. This matches local procedure call behavior but can be problematic for slow services. Some frameworks offer asynchronous variants where the client continues execution and handles responses later through callbacks or futures.
Challenges in RPC
Partial failure is the fundamental challenge that distinguishes distributed systems from local computing. Local procedure calls either work or the entire process crashes. With remote calls, the server may fail, the network may be slow or unreliable, or responses may be lost. The client cannot tell the difference between a slow server and a failed one.
Parameter passing poses challenges because most parameters in programs are passed by value, which makes remote communication easier by sending the data. However, some parameters are passed by reference. A memory address is meaningless on a remote machine. The common solution is copy-restore (also called copy-in/copy-out): send the referenced data to the remote side, let the server function use a local reference, then send the possibly modified data back to the client. This can be expensive for large data structures.
Data representation differs across machines. Different processors use different byte ordering: big-endian systems store the most significant byte first, while little-endian systems store the least significant byte first. Intel and AMD processors use little-endian. Network protocols traditionally use big-endian (network byte order). Serialization formats must handle these differences so that data can be correctly interpreted on any machine.
Two strategies exist for handling data representation differences:
-
A canonical format requires all senders to convert data to a standard format before transmission. ONC RPC used this approach with XDR, which always uses big-endian byte order. The disadvantage is that conversion happens on every message, even when communicating between identical architectures.
-
Receiver-makes-right allows the sender to transmit data in its native format, and the receiver converts only if necessary. DCE RPC used this approach with NDR. This avoids unnecessary conversion when machines share the same architecture.
Interface Definition and Type Systems
Interface Definition Languages (IDLs) describe service interfaces in a language-neutral way. IDLs enable stub generation for multiple languages and provide a contract between clients and servers.
Schemas define data structure, field types, and encoding rules. They enable validation, code generation from definitions, and structured evolution as interfaces change over time. Common schema formats include Protocol Buffers, JSON Schema, Avro, and Thrift IDL.
Schemas define the data that gets serialized. Serialization is important for marshalling and for data storage in file systems or object stores.
Serialization formats use one of two approaches for encoding data:
-
Implicit typing transmits only values without type information, requiring the receiver to know the expected sequence of parameters. This is efficient but does not easily support optional parameters.
-
Explicit typing transmits type information with each value, making data self-describing but larger. JSON and XML use explicit typing.
Protocol Buffers (often called protobuf) is a binary serialization format developed at Google. Each field in a message has a unique numeric tag that identifies it in the binary format. Fields may be omitted and will take default values. Protocol Buffers is compact, efficient, strongly typed, and supports schema evolution because new fields can be added with new tags and old code simply ignores fields it does not recognize. It is typically 3 to 10 times smaller than equivalent JSON and parsed 10 to 100 times faster. Other binary serialization formats exist, such as Apache Avro (used in big data ecosystems) and Apache Thrift.
Versioning allows interfaces to change while maintaining compatibility. Forward compatibility means new code can read old data. Backward compatibility means old code can read new data. Strategies include adding optional fields, avoiding field removal, and maintaining multiple service versions simultaneously.
Versioning is critical in distributed environments because you cannot count on all clients getting updates. For serialized stored data, versioning is critical because new software may encounter older data formats.
Early RPC Systems
ONC RPC (Sun’s Open Network Computing RPC) was one of the first RPC systems to achieve widespread use, thanks to the popularity of Sun workstations and NFS. It introduced several concepts that became standard:
-
An interface definition with the rpcgen compiler that creates stubs.
-
Program numbers for service identification.
-
A name service to allow clients to look up program numbers to find the current port number for the service.
-
Versioning support to allow gradual client migration.
-
XDR (eXternal Data Representation) as a canonical binary encoding format.
DCE RPC (Distributed Computing Environment RPC), defined by the Open Group, addressed some ONC RPC limitations. It introduced:
-
UUIDs (128-bit Universal Unique Identifiers) to replace manually chosen program numbers and eliminate collision risk.
-
Cells as administrative groupings of machines with directory servers for location transparency.
-
Receiver-makes-right data representation (NDR), which allows the sender to transmit data in its native format so the receiver converts only if necessary, avoiding unnecessary conversion between machines with identical architectures.
DCE RPC became the foundation for Microsoft’s RPC implementation and later DCOM.
Service Discovery
Services must be located before they can be invoked. Service discovery mechanisms include name servers that map service names to network locations, DNS for domain-based resolution, configuration files with hardcoded endpoints, and service meshes that manage service-to-service communication with sidecars.
Security
RPC calls often traverse networks where messages could be intercepted or modified. Security concerns include authentication (verifying the identity of clients and servers) and encryption (preventing eavesdroppers from reading message contents). Modern RPC systems typically use TLS to address both concerns.
Reliability and Failure Handling
Timeouts prevent clients from waiting indefinitely for unresponsive servers. A timeout specifies a duration, such as waiting at most 5 seconds.
Deadlines specify an absolute time by which an operation must complete, such as completing by 10:30:00 UTC. Deadlines are propagated through call chains so downstream services know how much time remains. With deadline propagation, if insufficient time remains, a service can fail fast rather than starting work that it cannot complete.
Cancellation allows clients to terminate in-progress requests when results are no longer needed, freeing resources on both client and server.
Retries handle transient failures but must be used carefully.
-
Idempotent operations produce the same result when executed multiple times and are safe to retry. Examples include retrieving the contents of a shopping cart or setting a user’s name to a specific value.
-
Non-idempotent operations produce side effects if run multiple times and require careful handling to avoid duplicate processing. Examples include transferring money between accounts or adding an item to a shopping cart.
-
RPC frameworks may offer at-least-once or at-most-once semantics for function calls. With at-least-once semantics, the RPC library may resend the request if it does not receive a timely response, so a remote procedure may execute one or more times. With at-most-once semantics, the RPC system tries to ensure the server executes the procedure no more than once, typically by tagging requests with unique IDs and suppressing duplicates. Local procedure calls have exactly-once semantics, but achieving exactly-once for remote calls is extremely difficult because you cannot distinguish “the server never received the request” from “the server processed it, but the response was lost.”
-
Idempotency keys provide a solution for non-idempotent operations that must be retryable. The client generates a unique identifier (typically a UUID) for each logical request. If a retry occurs, the same key is sent. The server stores results keyed by this identifier and returns the cached result for duplicate requests, avoiding re-executing the operation. Design challenges include determining how long to store keys and results, ensuring identifiers are never reused, and handling storage persistence across system restarts.
Idempotency keys differ from request IDs in their purpose. An idempotency key ensures an operation executes at most once by deduplicating requests at the application level. A request ID (or correlation ID) is used for observability and debugging, allowing you to trace a single request through multiple services in logs and traces.
Exponential backoff progressively increases the delay between retry attempts to avoid overwhelming struggling services. The delay adds random jitter to prevent synchronized retry storms in which many clients retry at exactly the same time.
Circuit breakers prevent cascading failures by failing fast when a dependency is unhealthy. A circuit breaker monitors failure rates and trips when failures exceed a threshold, immediately failing subsequent requests without attempting the call. After a cooling-off period, the circuit breaker allows test requests through, and if these succeed, normal operation resumes. This prevents requests from piling up waiting for timeouts and gives the failing service time to recover.
Distributed Objects and Lifecycle Management
Traditional RPC invokes stateless procedures. Distributed objects extend RPC to support object-oriented programming with stateful remote objects. Objects have identity (each instance is distinct), state (maintained between method calls), and lifecycle (objects must be instantiated and destroyed). This introduces challenges around object identity, state management, and lifecycle.
Microsoft DCOM (Distributed Component Object Model) extended COM to support remote objects. It used a binary interface standard supporting cross-language interoperability and tracked object instances via interface pointer identifiers.
DCOM introduced surrogate processes, which are generic host processes on the server that dynamically load components as clients request them. This eliminated the need for developers to write custom server processes and provided fault isolation since component crashes affected only the surrogate, not the client.
DCOM faced challenges with configuration complexity, platform dependence (Windows only), and stateful object management. It evolved into COM+, which added transactions, security, and object pooling.
Java RMI (Remote Method Invocation) is Java’s native approach to distributed objects. Remote objects implement interfaces extending java.rmi.Remote and are registered with an RMI registry. Clients receive stubs that transparently handle marshalling and communication. RMI uses dynamic stub generation through reflection and Java’s native serialization. It remains relevant primarily in legacy enterprise Java applications.
Distributed Garbage Collection
Because objects are created when needed, they must be deleted when no longer needed. In a local program, the garbage collector tracks object references and frees memory when an object has no more references. In a distributed system, references span machine boundaries, and the garbage collector on one machine cannot see references held by another.
Reference counting tracks how many clients hold references to each object and deletes them when the count reaches zero. This fails when clients crash without releasing references, when messages are lost, or during network partitions. Systems like Python’s RPyC use reference counting. Microsoft DCOM also used reference counting but fell back to leases.
Leases provide time-limited references to objects. Clients must periodically renew leases before they expire. If a client crashes or loses connectivity, the lease expires, and the server can safely delete the object. Leases use explicit time bounds: “you own this reference until time T, unless you renew.”
Heartbeats (also called keep-alive pinging) require clients to periodically send messages listing all objects they are using. The server treats silence as abandonment. The key difference from leases is that heartbeats imply continuous proof of liveness: “I will assume you’re dead if I stop hearing from you.” Java RMI uses lease-based garbage collection with explicit dirty/clean messages. Microsoft DCOM/COM+ uses keep-alive pinging, where clients periodically send ping sets listing active objects.
Both approaches accept that objects might occasionally be deleted prematurely if network delays prevent timely renewal, or kept alive slightly longer than necessary. These tradeoffs are generally acceptable given the alternative of memory leaks from crashed clients.
Why Web Services
Traditional RPC systems (ONC RPC, DCE RPC, DCOM) were designed for enterprise networks. They faced challenges on the internet: proprietary binary formats created interoperability problems across organizations, dynamic ports were blocked by firewalls, stateful objects complicated replication and load balancing, and strict request-response patterns could not handle streaming or notifications.
Web services emerged to solve these problems. By using HTTP as the transport, services could work through firewalls and proxies. By using text-based formats such as XML and JSON, services can achieve interoperability across different platforms and languages. The web’s existing infrastructure for security, caching, and load balancing could be reused.
XML-RPC was one of the first web services, marshalling data into XML messages transmitted over HTTP. It was human-readable, used explicit typing, and worked through firewalls. It failed to gain widespread adoption due to limited data types, lack of extensibility, XML’s verbosity, and missing features for interface definitions and schemas.
SOAP (formerly Simple Object Access Protocol) extended XML-RPC with user-defined types, message routing, and extensibility. SOAP supported various interaction patterns: request-response, request-multiple-response, asynchronous notification, and publish-subscribe. WSDL (Web Services Description Language) served as SOAP’s IDL for describing data types, messages, operations, and protocol bindings. SOAP declined due to complexity, heavyweight frameworks, interoperability issues between implementations, and verbosity.
REST
REST (Representational State Transfer) treats the web as a collection of resources rather than remote procedures. Resources are data identified by URLs. HTTP methods operate on resources using CRUD operations (Create, Read, Update, Delete):
-
POST creates new resources and is not idempotent because repeating the same POST may create multiple resources
-
GET retrieves resources and is idempotent because repeating the same GET returns the same result
-
PUT updates entire resources and is idempotent because repeating the same PUT produces the same state
-
PATCH updates parts of resources and may or may not be idempotent depending on the implementation
-
DELETE removes resources and is idempotent because repeating the same DELETE on a removed resource has no additional effect
REST is stateless, meaning each request contains all information needed to process it. Servers do not maintain client session state between requests. This improves scalability but shifts state management to clients or to the server application using session IDs and data stores.
REST typically uses JSON for data serialization. JSON is human-readable, simple, and has native browser support. However, it lacks schemas, requires parsing overhead, and has larger message sizes than binary formats.
REST emphasizes discoverability, with responses containing links to related resources that allow clients to navigate the API. REST is well-suited for CRUD operations but can be awkward for complex operations that do not map cleanly to resources.
gRPC
gRPC was developed at Google and uses Protocol Buffers for serialization and HTTP/2 for transport. HTTP/2 allows multiple requests and responses to flow over a single TCP connection simultaneously, uses compact binary headers, and supports server-initiated messages.
gRPC supports four communication patterns: unary (traditional request-response), server streaming (one request, multiple responses), client streaming (multiple requests, one response), and bidirectional streaming (multiple requests and responses interleaved).
gRPC includes built-in support for deadlines, cancellation, and metadata propagation. It has become the dominant choice for internal microservice communication.
gRPC is preferred for internal service-to-service communication, performance-critical paths, streaming requirements, and strongly typed contracts. REST over HTTP remains the default for public APIs, browser-based applications, simple CRUD operations, and situations where human readability is important.
Service-Oriented Architecture and Microservices
Service-Oriented Architecture (SOA) is an architectural approach that treats applications as integrations of network-accessible services with well-defined interfaces. Services are designed to be reusable components that communicate through standardized protocols. SOA emerged in the late 1990s and early 2000s alongside SOAP. SOA emphasized loose coupling between services, allowing them to evolve independently.
Microservices are a modern evolution of SOA principles. Both approaches structure applications as collections of services with well-defined APIs. The key differences are in implementation philosophy:
-
SOA typically used heavyweight middleware (enterprise service buses) for communication and orchestration, while microservices favor lightweight protocols like REST and gRPC with direct service-to-service communication.
-
SOA services were often large and shared databases, while microservices emphasizes smaller services where each service owns its own data.
-
SOA governance was centralized, while microservices favor decentralized governance where teams choose their own technologies.
Microservices can be viewed as SOA implemented with modern tooling and a bias toward simplicity. The core insight is the same: decompose applications into independent services that can evolve separately.
Benefits of microservices include independent scaling of components, technology flexibility per service, fault isolation, and parallel development by multiple teams. Drawbacks include distributed system complexity, network overhead, complex debugging and testing, operational burden, and eventual consistency challenges.
The modular monolith alternative keeps a single deployable unit while enforcing internal module boundaries. It captures organizational benefits like clearer ownership and cleaner interfaces without paying the cost of distributed communication.
Microservices make sense when independent change is the real problem: different teams, different release schedules, or very different scaling requirements. They are a poor fit when the main problem is product iteration, when teams are small, or when operational overhead is unaffordable.
Observability
Observability is the ability to reconstruct what happened from the signals a system emits. When requests cross multiple services, failures and slowdowns rarely show up where the problem actually is.
Logs record discrete events and are most useful when they include a request ID.
Traces record the path of a single request through the system. A trace consists of spans, where each span represents one operation and includes timing and parent-child relationships. Tracing answers “where did the time go?” across multiple services.
Request IDs (also called correlation IDs or trace IDs) are generated when a request enters the system and propagated through all downstream calls. The ID is included in logs and trace context to reconstruct the full request path during debugging. Request IDs are for observability, helping you understand what happened. They differ from idempotency keys, which ensure operations are not duplicated.
Circuit breakers prevent cascading failures by failing fast when a dependency is unhealthy, rather than letting requests pile up waiting for timeouts.
General Summary
RPC transforms low-level network communication into familiar procedure call interfaces through generated stubs, marshalling, and interface definitions. The evolution from traditional RPC through SOAP and REST to gRPC reflects the changing landscape of distributed computing: from enterprise networks to the internet, from XML to JSON and Protocol Buffers, from simple request-response to streaming.
Communication in distributed systems requires careful attention to reliability and failure handling. Timeouts prevent resource exhaustion. Retries handle transient failures. Idempotency and idempotency keys make operations safe to retry. Circuit breakers prevent cascading failures. Observability with request IDs and tracing enables understanding system behavior.
The right technology choice depends on context: REST for public APIs and simplicity, gRPC for internal communication and performance, with many systems using multiple technologies for different purposes.
Microservices are a tool for organizational scaling, not a technical silver bullet. They make sense when independent deployment is the real problem, but introduce significant operational complexity.
What You Don’t Need to Study
-
The differences between big-endian and little-endian (just know that there are differences in encoding data)
-
Any specific port numbers or network configurations
-
Historical dates or version numbers
-
Specific programming language APIs or library names beyond the concepts they illustrate
-
Modern RPC frameworks other than gRPC (Apache Thrift, Smithy, Finagle implementation details)
-
Binary serialization formats other than Protocol Buffers (Apache Avro, Thrift, Cap’n Proto, FlatBuffers)
-
Java RMI API specifics (registry methods, UnicastRemoteObject details, rmic tool)
-
DCOM specifics beyond understanding its goals, surrogate processes, and garbage collection approach
-
Security implementations (you don’t need to know how TLS works; just that using HTTPS provides a secure transport)
-
XML-RPC message format syntax
-
SOAP message structure or WSDL syntax
-
Specific company examples for RPC or web services deployments (Twilio Segment, Amazon Prime Video migrations)
-
Specific HTTP header names or status codes beyond the basic methods (GET, POST, PUT, PATCH, DELETE)
-
Details of HTTP/2 implementation (like frame types, flow control mechanisms, HPACK compression) - just know it’s binary and offers multiplexing communication channels)
-
Traces, Health Checks, Metrics
-
GraphQL
-
Message queues (these will be covered in later lectures)
Week 3: Clock Synchronization
UTC (Coordinated Universal Time) is the primary time standard used worldwide to track time. Time zones are an offset from UTC. Computers track time and try to stay close to UTC. They accomplish this by synchronizing their clocks with a system that knows the time.
How Computers Keep Time
Computers maintain time using hardware and software components. When a computer boots, the operating system reads time from a battery-powered real-time clock (RTC), a chip designed for low power consumption to survive power outages, not for accuracy. Once the OS is running, it maintains a more accurate system clock by reading high-resolution hardware counters, such as the Intel timestamp counter (TSC) or the ARM Generic Timer.
Most systems represent time as elapsed seconds since an epoch: January 1, 1970, 00:00:00 UTC for Unix systems, or January 1, 1601 for Windows. This representation avoids timezone confusion, daylight saving time ambiguities, and makes time arithmetic simple integer operations. The system clock that applications see is called wall time: it tracks UTC but can jump when synchronization corrections are applied.
Accuracy, Precision, and Resolution
Accuracy measures how close a measurement is to the true value. If your clock shows 12:00:00.005 and true UTC is 12:00:00.000, your clock has 5ms of error.
Resolution is the smallest time increment a clock can represent. A nanosecond-resolution clock can distinguish events 1 nanosecond apart. Higher resolution does not guarantee accuracy.
Precision is the consistency of repeated measurements. A clock consistently 5ms fast is precise but not accurate.
When we say “NTP achieves 10ms accuracy,” we mean clocks are within 10ms of true UTC, not that they measure time in 10ms increments.
Why Physical Clocks Drift
All physical clocks drift. A quartz oscillator’s frequency depends on temperature, manufacturing variations, atmospheric pressure, humidity, and aging. Consumer hardware typically drifts at 50-100 parts per million (ppm), meaning clocks can drift apart by almost nine seconds per day. Without synchronization, distributed systems quickly lose any agreement on time.
The Clock Model
A physical clock can be modeled as:
\[C(t) = \alpha t + \beta\]
where:
-
\(C(t)\) is the clock’s reading at true time \(t\)
-
\(\alpha\) represents the clock rate (ideally 1.0, but drift causes deviation)
-
\(\beta\) represents the offset from true time
Drift is the rate error – how fast a clock runs compared to true time. Offset is the instantaneous difference between a clock and true time. Even after perfect synchronization (zero offset), drift causes the offset to grow again. This is why periodic resynchronization is essential.
Clock Adjustment
When synchronization detects an offset, systems prefer slewing over stepping:
Slewing gradually adjusts the clock by making ticks slightly longer or shorter, maintaining monotonic time. This is preferred for small offsets (typically below 128ms).
Stepping instantly jumps to the correct time. This may be used for larger offsets (often ≥ 128ms). Stepping may break applications measuring durations or using timestamps (e.g., software build environments).
Cristian’s Algorithm
The simplest synchronization approach sends a request to a time server and receives a timestamped reply. The challenge is network delay: by the time the response arrives, it no longer reflects the current time. Cristian’s algorithm assumes the delay is symmetric and the timestamp was generated at the midpoint of that delay.
Algorithm:
-
Client sends request at local time t0
-
Server responds with timestamp TS
-
Client receives reply at t1
-
Client estimates time as \(T_S + \frac{t_1 - t_0}{2}\)
In reality, the server’s time may have been generated before or after the midpoint of the delay, potentially leading to an error in the time value. If we know the best-case network transit time, it will place additional limits on the error beyond the overall delay.
Error bound: If the minimum one-way delay is tmin, the error will be:
\[\epsilon \leq \frac{(t_1 - t_0) - 2t_{\min}}{2}\]
Clients can retry to find the lowest round-trip time, which yields the tightest error bound.
Additive errors: When machines synchronize in chains (A from B, B from C), errors accumulate. A’s total error = εA + εB. This is why systems generally would try to avoid a deep hierarchy.
A limitation of Cristian’s algorithm is that it has a single point of failure: the server.
Network Time Protocol (NTP)
NTP solves the single point of failure problem through a hierarchical architecture:
-
Stratum 0: Reference sources (GPS, atomic clocks)
-
Stratum 1: Servers synchronizing directly from stratum 0
-
Stratum 2: Servers synchronizing directly from stratum 1 servers
-
Higher strata: Synchronize from lower strata (maximum 15 levels)
Fault tolerance through multiple sources: NTP encourages systems to query multiple time servers and use statistical techniques to identify and reject outliers. NTP combines the remaining time offset estimates using a weighted average, with more weight given to more reliable servers. NTP tracks each server’s jitter (delay variation) and dispersion (accumulated timing uncertainty), favoring more reliable sources.
Synchronization algorithm uses four timestamps:
-
T1: Client sends request
-
T2: Server receives request
-
T3: Server sends response
-
T4: Client receives response
Offset: \(\theta = \frac{(T_2 - T_1) + (T_3 - T_4)}{2}\)
The network delay is the round-trip time minus the estimate of the processing delay on the server:
Delay: \(\delta = (T_4 - T_1) - (T_3 - T_2)\)
Clock discipline gradually adjusts the system clock. For small offsets (< 128ms), it slews. For large offsets, it steps. The discipline learns and compensates for drift over time by adjusting the tick frequency of the system clock.
SNTP is a simplified subset suitable for clients that only consume time. It omits sophisticated filtering and clock discipline of full NTP.
Precision Time Protocol (PTP)
PTP achieves sub-microsecond synchronization through hardware timestamping. Network interface cards with PTP support capture packet transmission and receipt timestamps at the physical layer, eliminating millisecond-level variability from software network stacks.
Architecture: A grandmaster clock provides authoritative time. Unlike NTP, where clients initiate requests, PTP is master-initiated: the grandmaster periodically multicasts sync messages.
The Best Master Clock Algorithm (BMCA) automatically selects the most suitable grandmaster based on priority, clock quality, accuracy, and stability.
PTP uses a four-message exchange:
-
Sync message at T1
-
Follow_Up containing T1
-
Delay_Req from slave at T3
-
Delay_Resp containing T4
The first two messages are due to some hardware limitations; the only purpose of the second message is to sent the timestamp of the Sync message (T1).
Offset: \(\frac{(T_2 - T_1) - (T_4 - T_3)}{2}\)
Cost: Unlike NTP, PTP requires specialized network cards and switches with hardware timestamping support.
When Physical Time Is Not Enough
Even perfectly synchronized physical clocks cannot order events that occur faster than clock resolution. At hundreds of thousands of events per second, many events share the same timestamp. More fundamentally, network delays obscure true ordering: an event timestamped earlier at one machine might arrive at another machine after local events with later timestamps.
For distributed databases, what matters is causality: whether one update could have seen another, not chronology. This leads to logical clocks.
What You Don’t Need to Study
-
Specific oscillator frequencies (e.g., 32,768 Hz for RTC) or piezoelectric physics
-
ppm values for specific oscillator types
-
The Intel TSC or ARM Generic Timer
-
Windows epoch date (1601); knowing the Unix epoch (1970) is sufficient
-
Any exact thresholds for slewing vs stepping
-
Details of adjtimex system call
-
Specific accuracy numbers for different NTP configurations
-
Formula for computing NTP or PTP
-
PTP message format details
Week 3: Logical Clocks
The Limits of Physical Time
Physical clock synchronization cannot solve all ordering problems in distributed systems. Even with perfectly synchronized clocks, events can occur faster than clock resolution. Many events share the same timestamp. More fundamentally, network transmission delays mean that when an event is timestamped at one machine, its timestamp may not reflect when it becomes visible elsewhere.
Consider a distributed database where two clients concurrently update the same record. What matters is not when the updates occurred in absolute time, but whether one update could have seen the other. If the updates are truly concurrent (neither saw the other), the system needs conflict resolution. If one happened after the other, the later one supersedes the earlier. This is a question of causality, not chronology.
The Happened-Before Relationship
Leslie Lamport’s 1978 paper revolutionized distributed systems by recognizing that physical time does not matter. What matters is the potential causal relationship between events. Lamport defined the happened-before relationship, written \(\rightarrow\), as a partial ordering on events. A partial ordering allows some pairs of elements to be incomparable: concurrent events have no ordering. This contrasts with a total ordering where every pair must be ordered consistently.
Definition: For events a and b:
-
If a and b occur on the same process and a occurs before b in that process’s execution, then \(a \rightarrow b\)
-
If a is the event of sending a message and b is the event of receiving that message, then \(a \rightarrow b\)
-
If \(a \rightarrow b\) and \(b \rightarrow c\), then \(a \rightarrow c\) (transitivity)
If neither \(a \rightarrow b\) nor \(b \rightarrow a\), we say a and b are concurrent, written \(a \parallel b\). Concurrent events happened independently with no potential causal influence.
The happened-before relationship captures potential causality: if \(a \rightarrow b\), then information from event a could have reached event b through the system’s communication channels.
Lamport Timestamps
Lamport timestamps assign each event a logical clock value such that if \(a \rightarrow b\), then timestamp(a) < timestamp(b).
Note the direction: we guarantee causally ordered events have increasing timestamps, but the converse is not true. If timestamp(a) < timestamp(b), we cannot conclude that \(a \rightarrow b\). The events might be concurrent.
Algorithm:
-
Each process maintains a counter, initially zero
-
On an internal event: increment the counter
-
When sending a message: increment the counter and include its value in the message
-
When receiving a message with timestamp T: set counter = max(counter, T) + 1
This ensures the happened-before property: if \(a \rightarrow b\) through a chain of events and messages, each step increases the counter, so timestamp(a) < timestamp(b).
Creating a total ordering: Lamport timestamps only provide a partial ordering: some timestamps may be identical. We can extend them to make each timestamp unique by ombining Lamport timestamps with process IDs:
\((t_1, p_1) < (t_2, p_2)\) if \(t_1 < t_2\), or if \(t_1 = t_2\) and \(p_1 < p_2\)
This breaks ties by process ID, giving every pair of events a definite order even if they are concurrent.
Limitation: Lamport timestamps cannot detect concurrency. If you observe timestamp(a) < timestamp(b), you cannot determine whether a causally precedes b or whether they are concurrent.
Vector Clocks
Vector clocks fully capture causal relationships and can detect concurrent events. Instead of maintaining a single counter, each process maintains a vector of counters: one for each process in the system.
For a process Pi, the entry Vi[j] represents “process i’s knowledge of how many events process j has executed.” Initially all entries are zero.
Algorithm:
-
On an internal event at Pi: increment Vi[i]
-
When sending a message from Pi: increment Vi[i] and include the entire vector in the message
-
When receiving a message with vector Vmsg:
-
For all k: set Vi[k] = max(Vi[k], Vmsg[k])
-
Then increment Vi[i]
-
In words, what vector clocks do is increment the count in the vector position representing the process. When a message is received, the process updates each element of the vector with the corresponding value in the received vector if that value is greater.
This propagates knowledge: when you receive a message, you learn everything the sender knew.
Comparing vector clocks:
Vectors Va and Vb satisfy:
-
\(a \rightarrow b\) if Va[i] ≤ Vb[i] for all i, and Va[j] < Vb[j] for at least one j
-
Same event if Va[i] = Vb[i] for all i
-
Concurrent (\(a \parallel b\)) if neither of the above (there exist indices where one is greater and other indices where it is smaller)
Implementation: In practice, vector clocks are implemented as sets of (processID, counter) tuples rather than fixed-size arrays. This handles systems where processes join dynamically and not all processes communicate with all others. You only track processes you have heard from.
Scalability: Each process maintains O(n) state where n is the number of processes. This can works well for dozens to hundreds of processes, but then the size of the vector can dominate the message.
Hybrid Logical Clocks (HLC)
Hybrid logical clocks bridge the gap between physical and logical time. Physical clocks provide real-world time but drift and jump. Logical clocks provide perfect causality but lose connection to wall-clock time. For some applications, like databases or version control systems, it’s useful to have both: human-friendly wall time and the ability to track handle causal relationships.
An HLC timestamp consists of two components:
-
L (logical component): A value close to physical time, representing the maximum physical time seen (from the system time or from a received message, if that was greater)
-
C (counter): Distinguishes events within the same clock tick
Together, (L, C) behaves like a logical clock while staying close to physical time.
Properties:
-
Preserves happened-before: if \(a \rightarrow b\), then HLC(a) < HLC(b) (lexicographic comparison)
-
Stays close to physical time: if clocks synchronized within ε using NTP, then l stays within ε of true time
-
Enables time-based queries while maintaining causality
HLC works with commodity servers that synchronize their clocks (typically via NTP).
Trade-off: HLCs are more complex than pure logical clocks, with timestamps that are slightly larger (two components), but they are an elegant solution for systems that need both causal consistency and time-based queries.
Process Identification
In distributed systems, “process ID” refers to a globally unique identifier for a computation entity, not a local Unix process ID. Common approaches to identify a process are:
-
Hostname + local PID: Simple but breaks if a process crashes and restarts
-
Node ID: Survives restarts but requires coordination
-
IP address + port: Works for long-running services with stable addresses
The reincarnation problem: If a process crashes and restarts, should it be the same process or a new one? Most systems treat it as new to avoid causality violations from lost state.
Choosing a Clock
For most distributed systems, the choice is between vector clocks and hybrid logical clocks:
Use vector clocks when:
-
You need to detect concurrent conflicting updates
-
Causality is critical for correctness
-
Number of processes is moderate (dozens to hundreds)
-
Examples: replicated databases, CRDTs, version control
Use hybrid logical clocks when:
-
You need both causality and approximate real time
-
You want time-based queries with causal consistency
-
Examples: distributed databases with MVCC
Lamport timestamps provide the conceptual foundation but are rarely used directly—systems that need ordering usually also need concurrency detection.
Matrix clocks exist for specialized applications requiring common knowledge tracking. They require O(n2) space and are rarely used in practice.
What You Don’t Need to Study
-
Matrix clock algorithm details or structure
-
Years of Lamport’s paper, vector clocks, HLC proposal
-
Specific database names beyond understanding representative examples
-
How systems implement version control or use clocks
-
Space complexity formulas (but understand O(1) for Lamport, O(n) for vector, O(n2) for matrix)
Week 4: Group Communication, Mutual Exclusion, and Leader Election
Sending, Receiving, and Delivering
These distinctions apply to any protocol handling reliability or ordering, not just multicast.
Sending is the act of transmitting a message from an application through the communication layer to the network.
Receiving is the act of a machine accepting a message from the network. The message has arrived at the machine but is not yet visible to the application.
Delivering is the act of passing a received message to the application. This is when the application actually sees and processes the message.
Between receiving and delivering, the communication layer may take one of three actions:
-
It may deliver the message immediately if no special handling is required.
-
It may discard the message if it is a duplicate.
-
It may place the message in a holdback queue if it arrived out of order and must wait for earlier messages.
All ordering and reliability guarantees refer to the delivery order, not the receipt order.
IP Multicast
IP Multicast provides network-layer one-to-many delivery. It uses two protocols: IGMP handles communication between hosts and their local routers, while PIM handles the routing of multicast traffic between routers.
IGMP (Internet Group Management Protocol) operates between hosts and their directly connected routers. When a host wants to receive multicast traffic for a group, it sends an IGMP membership report (join message) to its local router. The router then ensures that multicast traffic for that group flows to that network segment. Routers periodically send IGMP queries to discover which groups have active members, and hosts respond with membership reports.
PIM (Protocol Independent Multicast) routes multicast traffic across the network. It is called “protocol independent” because it uses the existing unicast routing table rather than building its own. PIM operates in two modes that reflect different assumptions about how receivers are distributed.
PIM Dense Mode uses a flood-and-prune approach. The protocol initially floods multicast traffic to all routers, and routers with no interested receivers send prune messages upstream to stop receiving traffic. This mode is appropriate when most subnets have receivers interested in the traffic.
PIM Sparse Mode requires receivers to explicitly request traffic. The protocol uses a Rendezvous Point (RP) that serves as a meeting point for sources and receivers. When a host joins a multicast group, routers send Join messages toward the RP, building a shared distribution tree. Sources send their traffic to the RP, which then distributes it down the tree to all receivers. This means a source only needs to send one stream to the RP regardless of how many receivers exist. Sparse mode is appropriate when receivers are sparsely distributed across the network.
IP Multicast works well in controlled environments such as cable TV networks and data center trading systems. However, most ISPs block it at network boundaries due to traffic engineering complexity, billing challenges, and security concerns.
Application-Level Multicast
Production distributed systems implement application-level multicast over reliable unicast. Reliability and ordering are two independent dimensions that can be combined as needed.
Reliability Levels
Unreliable multicast provides best-effort delivery with no guarantees. Messages may be lost, duplicated, or delivered to only some recipients.
Best-effort reliable multicast guarantees that if the sender completes without crashing, all live recipients receive the message. It also guarantees no duplication and no spurious messages.
The implementation uses timeouts and retransmission: the sender transmits to all recipients using reliable unicast (such as TCP), waits for acknowledgments, and retransmits if an acknowledgment does not arrive within the timeout period. However, this approach does not guarantee consistency if the sender crashes mid-transmission, and messages do not survive system restarts.
Reliable multicast provides strong consistency guarantees even when the sender crashes during transmission. It guarantees three properties:
-
Agreement: If any correct process delivers a message, all correct processes eventually deliver it.
-
Integrity: Messages are delivered at most once and only if they were actually sent.
-
Validity: A correct sender eventually delivers its own message.
Reliable multicast does not guarantee persistence across system restarts.
Durable multicast adds persistence to reliable multicast. Messages are written to stable storage before being acknowledged, so they survive crashes and restarts. Apache Kafka is an example of a system that provides durable multicast.
Publish-subscribe (pub/sub) systems apply group communication at scale. Publishers send to named topics; subscribers register interest in topics. The topic acts as a named multicast group. We cover this in detail later in the course.
Ordering Levels
Unordered delivery provides no guarantees about message sequence. Messages may arrive in any order at different recipients.
Single source FIFO ordering (SSF) guarantees that messages from the same sender are delivered in the order they were sent.
Formally: if a process sends multicast(G, m) and later sends multicast(G, m′), then every correct process that delivers m′ will have already delivered m.
The implementation uses per-sender sequence numbers: each sender maintains a counter and attaches it to each message, and receivers buffer messages until they can be delivered in sequence.
Causal ordering guarantees that if message m1 happened-before message m2, then m1 is delivered before m2 at all processes.
Formally: if multicast(G, m) → multicast(G, m′), where → denotes happened-before, then every correct process that delivers m′ will have already delivered m. Causal ordering implies single-source FIFO ordering because messages from the same sender are causally related.
The implementation uses vector clocks (the same data structure from the logical clocks lecture). Each message carries the sender’s vector, and the receiver buffers messages until all causally preceding messages have been delivered.
Total ordering guarantees that all processes deliver all messages in the same order.
Formally: (1) if a process sends m before m′, then any correct process that delivers m′ will have already delivered m; and (2) if a correct process delivers m before m″, then every correct process that delivers both will deliver m before m″.
Total ordering does not imply causal or single source FIFO ordering; it only requires that everyone agrees on the same order. One implementation uses a central sequencer; another uses distributed agreement where processes propose and converge on sequence numbers.
Synchronous ordering uses a sync primitive (a special set of messages) that acts as a barrier. When a process issues a sync, it blocks until all in-flight messages have been delivered to all recipients.
Formally: if a process sends messages M, then issues a sync, then sends messages M′, every correct process will deliver all messages in M before delivering any message in M′.
This creates logical message groups or epochs with clean boundaries between them. The sync primitive is used for view changes to ensure all old-view messages are delivered before transitioning to a new view.
Real-time ordering would deliver messages in actual physical time order (mirroring exactly when they were sent). This is impossible to implement perfectly because clocks cannot be perfectly synchronized.
Atomic multicast combines reliable multicast with total ordering. It guarantees that all correct processes deliver the same set of messages in the same order. Some systems additionally require the total order to respect causality.
Failure Detection
Failure detection determines which processes have crashed. In asynchronous systems, perfect failure detection is impossible because a slow process is indistinguishable from a crashed one. This observation underlies the FLP impossibility result, which proves that consensus cannot be guaranteed in asynchronous systems where even one process might crash.
Real systems work around FLP by adding timeouts (accepting occasional false suspicions), assuming partial synchrony (eventual timing bounds), or using randomness.
False positives occur when a failure detector incorrectly suspects a live process has crashed. False negatives occur when a failure detector fails to detect that a process has actually crashed.
Heartbeat-based detection uses periodic messages to indicate liveness. There are two approaches:
-
In push-based heartbeating, monitored processes periodically send heartbeat messages to monitors.
-
In pull-based heartbeating (also called pinging), monitors periodically query processes and expect responses.
A monitor that does not receive heartbeats within a timeout period suspects the process has failed. The choice of timeout involves a tradeoff: shorter timeouts detect failures faster but generate more false positives.
The Phi Accrual Failure Detector
The phi accrual failure detector outputs a continuous suspicion level (φ) rather than a binary alive/dead judgment.
The detector maintains a sliding window of time gaps between consecutive heartbeats and learns what “normal” timing looks like for this particular connection. When a heartbeat is late, the detector calculates how surprising this silence is given the learned distribution.
The φ value is on a logarithmic scale: φ = k means roughly a 10−k probability that this delay is a normal variation. For example, φ = 3 means about a 0.1% (one in a thousand) chance that the process is still alive and the heartbeat is just delayed; φ = 8 means about a 0.000001% chance.
Applications choose a threshold based on their needs. A lower threshold means faster detection but more false positives. Apache Cassandra (a distributed database) uses this detector with a configurable threshold that defaults to 8. The key advantage is that it separates monitoring from decision-making: the detector provides suspicion information, and applications decide when to act.
Virtual Synchrony and Group Membership
Group membership maintains a consistent view of which processes are in a group. The membership is tracked collectively by all members rather than by a central server. Each process runs a group membership service (GMS) layer that monitors other members, participates in view change protocols, and notifies the application when membership changes.
A view is a snapshot of group membership containing a unique identifier (typically a monotonically increasing number) and a list of member processes. All processes in a view agree on its membership.
Message Stability
When a process multicasts a message, it does not get delivered immediately. The protocol works as follows:
-
The sender sends the message to all group members.
-
Each member receives the message, buffers it, and sends an acknowledgment back to the sender.
-
When the sender receives acknowledgments from all members of the current view, the message is stable.
-
The sender sends a stability announcement to all members.
-
Only after receiving this announcement can members deliver the message to the application.
A message is unstable if it has been received by some members but the sender has not yet confirmed stability (either because acknowledgments are still pending or because the sender crashed before completing the protocol). Unstable messages are held in a buffer and cannot be delivered to applications.
This protocol ensures that if a sender crashes mid-transmission, no member will have delivered a partially-sent message. Either the message becomes stable (all members have it and can deliver), or it remains unstable and is eventually discarded.
View Changes
A view change is a protocol that transitions all members from one view to another when membership changes due to a join, leave, or failure. The protocol ensures that all processes agree on which messages were delivered in the old view before transitioning.
The challenge is handling in-flight messages: messages that have been received and buffered by some members but are unstable because the sender crashed or has not yet confirmed stability. These messages are sitting in member buffers, waiting for a stability announcement that may never come.
The view change protocol works in three phases:
-
Flush: Processes stop sending new application messages and exchange information about which messages they have buffered but not yet delivered. This propagates any in-flight messages to all surviving members.
-
Collect: Each process waits to receive flush messages from all surviving members. After this phase, all survivors have the same set of buffered messages.
-
Commit: Processes agree on the new view membership. Messages that all survivors have are marked stable and delivered. Messages that only some members received are discarded. All processes then atomically transition to the new view.
When a process fails, the failure detector notices the unresponsive process, and any detecting process can initiate a view change. The failed process is excluded from the new view. A recovered process must rejoin as a new member.
Recovery and State Transfer
If a failed process recovers, it cannot simply rejoin the group with its old state. While the process was dead, the group continued operating: messages were delivered, state changed, and views advanced. The recovered process has stale state and an outdated view of the world.
To rejoin, the recovered process must perform a state transfer:
-
The recovering process contacts an existing group member and requests a state transfer.
-
The existing member sends its current state (either a full snapshot or a recent checkpoint plus subsequent updates).
-
The recovering process initializes its replica to this state.
The state transfer is treated as an atomic event: no other processing occurs at the recovering process until the transfer is complete. This ensures the process does not attempt to participate in the group with partially updated state.
Once the state transfer completes, the recovered process joins the group as a new member through the normal join protocol, triggering a view change that adds it to the membership.
Virtual synchrony coordinates membership changes with message delivery. The key guarantee is view synchrony: if a message is delivered in a view, it is delivered in that same view at all processes that deliver it. This ensures that all processes agree on what the group membership was when each message was delivered.
Distributed Mutual Exclusion
Distributed mutual exclusion ensures that at most one process is in a critical section at any time without relying on shared memory. The algorithms must satisfy three properties:
-
Safety (mutual exclusion): At most one process is in the critical section at any time.
-
Liveness (progress): If a process requests the critical section and no process holds it forever, the requester eventually enters.
-
Fairness (bounded waiting): There is a bound on how many times other processes can enter before a waiting process is granted access.
Centralized Mutual Exclusion Algorithm
The centralized approach designates one process as the coordinator. When a process wants to enter the critical section, it sends a request message to the coordinator. If the critical section is free, the coordinator sends a grant message back immediately. If the critical section is occupied, the coordinator queues the request. When the process in the critical section finishes, it sends a release message to the coordinator, which then sends a grant to the next process in the queue.
This approach requires only three messages per critical section entry (request, grant, release) and is simple to implement. The drawback is that the coordinator is a single point of failure and a potential performance bottleneck.
Lamport’s Mutual Exclusion Algorithm
Lamport’s algorithm is fully distributed with no central coordinator. Each process maintains a request queue ordered by Lamport timestamp, with ties broken by process ID to ensure a total order.
When a process wants to enter the critical section, it timestamps its request and sends it to all other processes. When a process receives a request, it adds it to its local queue and sends an acknowledgment back to the requester. A process may enter the critical section when two conditions are met:
-
Its own request is at the head of its queue (meaning it has the earliest timestamp), and
-
It has received acknowledgments from all other processes.
When a process exits the critical section, it sends a release message to all other processes. Upon receiving a release, each process removes that request from its queue.
The algorithm requires 3(N−1) messages per critical section entry: N−1 requests, N−1 acknowledgments, and N−1 releases. It guarantees fairness because requests are ordered by timestamp. However, it assumes all processes are correct and responsive; if one process crashes and stops sending acknowledgments, other processes will block indefinitely.
Ricart-Agrawala Algorithm
The Ricart-Agrawala algorithm optimizes Lamport’s algorithm by eliminating the release messages. Instead of always acknowledging immediately, a process defers its reply if it also wants the critical section and has a higher priority (earlier timestamp).
When a process wants to enter the critical section, it timestamps its request and sends it to all other processes. When a process receives a request, it compares the request’s timestamp to its own. If it does not want the critical section, or if its own request has a later timestamp, it sends a reply immediately. Otherwise, it defers the reply until after it exits the critical section.
A process enters the critical section when it has received replies from all other processes. Upon exiting, it sends all the deferred replies.
This reduces the message count to 2(N−1) per entry: N−1 requests and N−1 replies. Like Lamport’s algorithm, it requires all processes to respond.
Token Ring Mutual Exclusion
The token ring algorithm organizes processes in a logical ring and uses a circulating token. Only the process holding the token may enter the critical section.
The token circulates continuously around the ring. When a process receives the token and wants the critical section, it enters. When it finishes (or if it does not want the critical section), it passes the token to its neighbor.
This approach provides bounded waiting and is simple to implement. The drawbacks are that the token generates continuous network traffic even when no process wants the critical section, and if the token is lost due to a crash, a recovery mechanism must regenerate it while avoiding duplicate tokens.
Algorithm Comparison
| Algorithm | Messages | Fault Tolerance | When to Use |
|---|---|---|---|
| Centralized | 3 | Coordinator failure blocks progress | When simplicity is important and fast coordinator recovery is available |
| Lamport | 3(N−1) | Requires all processes respond | When fairness and full distribution are needed |
| Ricart-Agrawala | 2(N−1) | Requires all processes respond | Optimal choice for permission-based mutual exclusion |
| Token Ring | 1 to N−1 | Token loss requires recovery | When bounded waiting is important and requests are frequent |
Leader Election
Leader election selects a single coordinator from a group of processes. The elected leader typically has the highest process ID among surviving processes.
Bully Algorithm
The bully algorithm assumes a synchronous model with timeouts for failure detection. It uses three message types: ELECTION (to announce a new election), OK (to acknowledge an election message and tell the sender a higher-ID process is alive), and COORDINATOR (to announce the winner).
When a process P detects that the coordinator has failed, it initiates an election by sending ELECTION messages to all processes with higher IDs. If P receives no OK responses within a timeout, it declares itself coordinator and sends COORDINATOR messages to all other processes. If P does receive an OK response, it knows a higher-ID process will take over and waits for a COORDINATOR message.
When a process receives an ELECTION message from a lower-ID process, it sends an OK response and starts its own election if it has not already. The highest-ID surviving process always wins.
The worst-case message complexity is O(n2), occurring when the lowest-ID process initiates the election. The best case is O(n).
Ring Election Algorithm
The ring election algorithm organizes processes in a logical ring. When a process notices the coordinator has failed, it creates an election message containing its own ID and sends it clockwise to its neighbor.
When a process receives an election message, it compares the ID in the message with its own:
-
If the received ID is larger than its own, the process forwards the message unchanged because a higher-ID process should win.
-
If the received ID is smaller and the process has not yet participated in this election, it replaces the ID in the message with its own ID and forwards it, nominating itself.
-
If the received ID is smaller but the process has already sent its own election message, it discards the message to avoid continuing a duplicate election.
When a process receives an election message containing its own ID, it knows its message has traveled all the way around the ring and its ID is the largest. It then sends an ELECTED message around the ring to announce itself as the new coordinator.
The worst case is 3N−1 messages, occurring when the process immediately following the highest-ID process initiates the election.
What You Do Not Need to Study
You do not need to memorize exact pseudocode, PIM protocol details, or the exact phi calculation formula. Focus on understanding the concepts: why IGMP and PIM exist (host membership vs. router distribution), why sparse mode uses an RP (explicit joins vs. flooding), how phi adapts to observed network conditions, and why virtual synchrony coordinates membership with delivery.