Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don’t take the one hour time window in the title literally.
Last update: Sat Mar 28 19:55:23 2026
Week 5: Consensus
The Core Problem
Consensus is the problem of getting a group of nodes to agree on a single value, even when some nodes crash or messages are delayed. It comes up in leader election, transaction commit, log ordering, and configuration management. Every fault-tolerant distributed system is built on top of some form of consensus.
Without proper consensus, a network partition can cause both sides to elect their own leader and accept writes independently. This condition is called split-brain: the system ends up with two divergent versions of state that must be reconciled or rolled back when the partition heals.
Replicated State Machines
A state machine is deterministic: the same inputs in the same order always produce the same state. State machine replication runs identical copies of the state machine on multiple servers, ensuring they all apply the same commands in the same order. A replicated log is the data structure that imposes that order – consensus ensures all servers agree on the contents of the log.
Key property: if every server starts from the same initial state and executes the same log entries in order, they will all reach the same state. This is why log ordering is the central challenge.
Consensus Properties
Three properties required for consensus are:
-
Agreement: All non-faulty processes decide on the same value.
-
Validity: The decided value must have been proposed by some process.
-
Termination: All non-faulty processes eventually decide.
Agreement and validity are safety properties (nothing bad happens). Termination is a liveness property (something good eventually happens). The tension between them is central to the difficulty of consensus.
FLP Impossibility
The FLP Impossibility Result proves that in a purely asynchronous distributed system, no deterministic algorithm can guarantee consensus if even one process may crash.
The core obstacle: in an asynchronous system, there is no way to distinguish a crashed process from a very slow one. This makes it impossible to deterministically break certain deadlocked states without risking a safety violation.
FLP does not mean consensus is impossible in practice. All real protocols handle it by guaranteeing safety unconditionally but sacrificing liveness under extreme instability (for example, if no leader can be elected because the network is too chaotic). Eventually, when the system stabilizes, progress resumes.
Paxos
Paxos was created by Leslie Lamport around 1989 and finally published in 1998. It is the foundational consensus algorithm in the field.
Paxos has three roles: proposers (initiate proposals), acceptors (vote), and learners (learn the decided value). In practice, servers typically play all three roles.
The protocol relies on a simple but powerful property: any two majorities of acceptors in a group of n share at least one member. That overlap means a new majority cannot form without including at least one acceptor that participated in an earlier majority. Combined with Paxos’s promise and value-selection rules, this prevents two different values from being chosen for the same decision.
The algorithm runs in two phases:
-
Phase 1 (Prepare/Promise): A proposer sends Prepare(n) to a majority of acceptors. Each acceptor promises to reject any proposal numbered below n and reports the highest proposal it has already accepted.
-
Phase 2 (Accept/Accepted): The proposer sends Accept(n, v) to the majority, where v is the value from the highest-numbered previously accepted proposal it learned about (or its own value if none were reported). An acceptor accepts Accept(n, v) only if it has not since promised a higher proposal number; if it accepts, it records (n, v) and replies Accepted.
A value is decided once a majority of acceptors have accepted it.
Multi-Paxos extends single-decree Paxos to decide a sequence of values (a log) by running a separate Phase 2 per log slot, with a stable leader that skips Phase 1 after the first slot. When the leader changes, the new leader must run Phase 1 again for any log slots it has not yet resolved, which is why leader instability is expensive in Paxos.
Paxos is notoriously difficult to implement correctly. It leaves important practical questions unspecified: conflict resolution between concurrent proposers, cluster membership changes, and recovery from partial failures. Real deployments (Google Chubby, Apache ZooKeeper’s Zab variant, Google Spanner) required substantial engineering beyond the base algorithm.
Raft
Raft was developed to provide the same safety guarantees as Multi-Paxos but is designed to be significantly easier to understand and implement. It is used in many newer systems, including etcd, CockroachDB, TiKV, Consul, and YugabyteDB.
Terms
Raft divides time into terms, each numbered with a consecutive integer. A term begins with an election. If a candidate wins, it serves as leader for that term. If no one wins (split vote), a new term begins. Terms serve as a logical clock – servers reject messages from older terms and update their term when they see a higher one.
Server States
Every server is in exactly one of three states. A follower is passive: it responds to requests from leaders and candidates but does not initiate any. All servers start as followers. A candidate is a follower that has timed out waiting for a heartbeat and has initiated an election. A leader handles all client requests, replicates log entries to followers, and sends periodic heartbeats to prevent new elections.
Leader Election
Each follower maintains a randomized election timeout (typically 150–300 ms). If it expires without hearing from a leader, the follower starts an election:
-
Increment the current term.
-
Transition to candidate state and vote for itself.
-
Send RequestVote RPCs to all other servers.
-
A server grants its vote if it has not already voted this term and the candidate’s log is at least as up-to-date as its own.
-
If the candidate receives votes from a majority, it becomes leader and immediately sends heartbeats to suppress new elections.
-
If no candidate wins (split vote), the term ends with no leader and a new election begins with a higher term.
“More up-to-date” is defined precisely: a log is more up-to-date if its last entry has a higher term, or if the terms are equal, the longer log wins. This restriction ensures that a candidate cannot win unless its log contains all committed entries. The randomized timeout makes it unlikely that multiple candidates start elections at the same time.
Log Replication
Once a leader is elected, it handles all client requests. For each command, the sequence is:
-
The leader appends the command to its own log, tagged with the current term and index.
-
It sends AppendEntries RPCs to all followers in parallel.
-
Once a majority of servers have acknowledged the entry, the leader commits it.
-
The leader applies the entry to its state machine and returns the result to the client.
-
Subsequent AppendEntries messages inform followers of the commit index, at which point followers apply the committed entries to their own state machines.
Each AppendEntries RPC includes the index and term of the entry immediately preceding the new one. A follower rejects the RPC if its own log does not match at that position. When this happens, the leader backs up and retries from an earlier entry until it finds a point of agreement, then overwrites any conflicting entries from that point forward.
The Log Matching Property guarantees that if two entries in different logs share the same index and term, the logs are identical through that index. This invariant is what makes the consistency check in AppendEntries sufficient to detect and repair divergence.
Commit Rules
An entry is committed once stored on a majority. However, the leader may not commit entries from previous terms directly – it must first commit an entry from its own current term. Old entries become committed implicitly when the current-term entry is committed. This prevents a subtle safety bug involving overwriting entries that were transiently replicated but not truly committed.
Safety: The Leader Completeness Property
If an entry is committed in a given term, it will appear in the log of every leader in all subsequent terms. This follows from the election restriction: a candidate can only win if its log is at least as up-to-date as any majority, which means it must have all committed entries. Safety in Raft is unconditional – the protocol never allows two servers to commit different entries at the same index, as long as fewer than half the servers fail.
Liveness
Liveness is conditional. Raft requires a stable elected leader to make progress. If elections repeatedly fail due to network instability, the system stalls. In practice, randomized timeouts prevent this from being a persistent problem.
Cluster Membership Changes
Adding or removing servers requires care. Raft uses joint consensus, a two-phase approach: the cluster transitions through a configuration that includes both old and new member sets, requiring majority agreement from both before switching to the new configuration alone. Even when adding or removing one server at a time, the reconfiguration must ensure majority intersection across the transition – joint consensus is how Raft guarantees that property.
Log Compaction
Logs grow indefinitely. Servers periodically take a snapshot of the state machine and discard all log entries before that point. If a follower falls too far behind, the leader sends it the snapshot directly via an InstallSnapshot RPC.
Key Takeaways
Consensus is hard because crash failures make it impossible to distinguish a dead process from a slow one. FLP tells us something fundamental about this: in a fully asynchronous model, no algorithm can be both safe and live if any process can crash. Real protocols choose safety and accept that they may stall temporarily.
Raft improves on Paxos primarily through clarity of design: a single strong leader, randomized election timeouts, and unified handling of log replication and leader completeness. The safety guarantees are equivalent; the difference is in how easy the protocol is to reason about and implement correctly.
What You Don’t Need to Study
-
The exact millisecond ranges for election timeouts.
-
The publication histories of Paxos or Raft.
-
Details of specific Paxos variants like Zab or Multi-Paxos optimizations.
-
The internal architecture of specific systems that use Raft or Paxos.
Week 6: Coordination Services and Network-Attached Storage
Coordination Services
Distributed systems frequently need to answer questions like: who is the current leader? Is this lock held? What is the current configuration? Getting these answers wrong can be catastrophic. If two nodes both believe they are the leader, they will independently accept writes and the system state will diverge.
The natural approach is a dedicated replicated coordinator made fault-tolerant through consensus. A coordination service provides that. It is a small, strongly consistent, highly available store that distributed applications use to coordinate decisions and share small amounts of state. Every coordination service we study (Chubby, ZooKeeper, and etcd) uses consensus internally.
What Coordination Services Provide
All three services share a core set of capabilities:
-
A persistent, consistent store for small amounts of metadata (such as configuration data, not designed for general application data); read and write entire contents
-
Leases or session timeouts that automatically expire when a client fails, allowing self-cleaning locks and registrations
-
Watches or events that notify clients when data changes, eliminating polling
-
Atomic operations (compare-and-swap or conditional transactions) that prevent races when multiple clients contend for the same resource
-
Strong consistency guarantees so that all clients see the same current state
Chubby
Chubby is a lock service and configuration store from Google. A Chubby deployment is a cell of five replicas, one of which is the master elected via Paxos. The other replicas participate in getting replicated data via consensus, but redirect client requests to the master. Three of five replicas must be alive for the cell to function, allowing two simultaneous failures. For performance, Chubby caches all contents in memory.
Chubby exposes a file system interface: locks, configuration data, and service addresses are all stored as named files in a hierarchical namespace. Locks are advisory and coarse-grained, meaning they are held for long periods and are not enforced by the system if code chooses to ignore them.
When a client opens a file, it receives a lease – a time-bounded guarantee that its cached copy is valid. The master sends cache invalidations to other clients when a file is written. If the master fails, a new one is elected, broadcasts a new epoch, and gives clients a grace period to reconnect. Clients that do not reconnect within the grace period have their sessions and locks released.
ZooKeeper
ZooKeeper was developed at Yahoo and is open-source. Rather than providing locks as a primitive, it provides building blocks from which locks, leader election, barriers, and other coordination patterns can be constructed.
Data is stored in a tree of znodes, each holding a small amount of data. There are two types of znodes:
-
Persistent znodes survive client disconnection and remain until explicitly deleted.
-
Ephemeral znodes are automatically deleted when the client session that created them ends. This is the key mechanism for failure detection.
Either type can optionally be created as a sequential znode, which causes ZooKeeper to append a monotonically increasing integer to the name. This is essential for implementing locks without causing thundering herd problems.
The thundering herd problem occurs when many clients are all waiting for the same condition, and a single state change wakes them all simultaneously. They all rush to retry at once, creating a burst of load on the coordination service, while only one of them can make progress. In a ZooKeeper lock implementation, the fix is to have each waiting client watch only its immediate predecessor in the queue rather than the lock node itself. When the lock is released, exactly one client is notified instead of all of them.
ZooKeeper uses a consensus protocol similar to Raft to replicate writes through a leader in a globally consistent order. Reads can be served by any replica and are sequentially consistent; clients can issue a sync to ensure a replica has caught up to recently committed writes before reading.
Watches are one-shot notifications. A client sets a watch when it reads a znode (checking existence, data, or children), and ZooKeeper delivers an event if the relevant state changes. The watch fires once and is then removed; the client must re-register if it wants continued monitoring. The usual pattern is to treat the event as “something changed,” re-read the current state from ZooKeeper, and re-register the watch. This keeps clients consistent even if multiple changes occur quickly or during a brief disconnection.
etcd
etcd is the authoritative store for Kubernetes cluster state. It uses Raft for consensus. Unlike Chubby and ZooKeeper’s hierarchical namespace, etcd stores a flat key-value map: keys are arbitrary byte strings, and hierarchy is a naming convention, not an enforced structure. Prefix range queries and prefix watches provide directory-like behavior.
etcd routes reads through the leader by default, giving linearizable reads: reads that reflect the most recently committed write, as if the entire system had a single consistent view at that instant. Serializable reads (served locally by any replica) are available as an opt-in for workloads that can tolerate slight staleness. Leases with TTLs provide the same self-cleaning behavior as ZooKeeper’s ephemeral znodes. Transactions with compare-and-swap conditions enable atomic leader election and lock acquisition.
Common Coordination Patterns
These patterns apply to all three coordination services; only the specific primitives differ.
Leader election. Replicas contend for a well-known name in the coordination service. Exactly one wins. The winner’s claim disappears when its session expires, allowing others to compete again.
Distributed locks. Acquiring the lock is a write-through consensus that provides global ordering. Locks built on ephemeral nodes or leases are self-cleaning: a crashed holder’s session expires, and the lock releases automatically. Coordination services are suited to coarse-grained locks: locks held for long periods protecting large resources (a master election, a configuration update). They are not suited to fine-grained locks held for milliseconds to protect individual rows or records. High-frequency lock acquisitions and releases would overwhelm a system built around consensus.
Configuration management. Services store configuration in the coordination service. Updates go through consensus and are applied consistently. Clients watch configuration keys for changes.
Service discovery. A running instance registers its address under a known prefix using an ephemeral key. The list of active instances stays current because dead servers’ keys expire automatically.
Fencing tokens. A monotonically increasing number associated with each lock grant. The protected resource rejects any request carrying a token lower than the highest it has seen, preventing a stale lock holder that woke up after a pause from corrupting shared state.
What Coordination Services Do Not Do
A coordination service is designed to small amounts of metadata. It is not a database or a message queue, and is not suitable for large files, frequent access, or high-throughput writes. A useful rule of thumb is: if the data is on the critical path of every client request, it does not belong in a coordination service.
More fundamentally, electing a leader through a coordination service does not guarantee that the application as a whole behaves correctly. Coordination serializes decisions; application correctness is still the developer’s responsibility.
Network-Attached Storage
Access Transparency and VFS
The goal of networked file systems is access transparency: applications use standard file system calls (open, read, write, close) against remote files without any awareness that the storage is remote. This is achieved through the Virtual File System (VFS) layer, adopted by every major Unix-derived OS. VFS defines a standard interface that any file system driver must implement. The kernel always talks to this interface; whether the driver beneath it issues disk commands or sends network requests is invisible to applications.
Mount points attach different file systems into a single directory tree. A remote file system client is a VFS driver that translates standard file operations into network requests. When the response arrives, the result passes back through the VFS interface to the application.
Design Dimensions
Every networked file system must navigate three fundamental tradeoffs:
Consistency. Multiple clients may cache the same file. Keeping those caches consistent requires either frequent polling against the server or a protocol where the server pushes invalidations to clients.
State. A stateless server holds no information about client activity between requests. Every request is self-contained. Crash recovery is trivial because there is nothing to recover. But statelessness makes locks, open file tracking, and cache invalidation impossible. A stateful server enables richer semantics at the cost of recovery complexity: after a crash, open files, locks, and cached state must be rebuilt or cleaned up.
Caching. Options range from write-through (immediate server update), to write-behind (delayed batch), to write-on-close / session semantics (send changes only on close). Callbacks, where the server tracks which clients have cached a file and pushes invalidations on modification, require statefulness but eliminate polling.
NFS
NFS was designed to be simple, stateless, and interoperable across any networked system. It was built on openly published RPC and data encoding standards and was ported to many operating systems in both client and server roles.
Because the server is stateless, NFSv2 has no OPEN, CLOSE, LOCK, SEEK, or APPEND procedures. Clients identify files by file handles: opaque server-generated identifiers that persist across server restarts. The server uses UDP as the transport because the stateless, idempotent design makes retries safe.
Key limitations of stateless NFS:
-
No native locking. File locking was added through a separate Network Lock Manager (NLM) service, an awkward retrofit whose state was not part of NFS’s crash recovery.
-
No safe append. There is no atomic append operation; a client must read the file size then write at that offset, which is a race condition when multiple writers are active.
-
No open file reference tracking. A file can be deleted on the server while a client still has it open. NFS clients work around this with “silly renames” before sending a REMOVE.
-
Weak security. Original NFS trusted the user ID sent by the client without verification.
NFS caches data in blocks and validates cached data using timestamp comparison: the client checks the file’s modification time on the server when a file is opened, and after a short validity timeout. This gives close-to-open consistency: stale reads are possible between opens.
AFS
AFS was designed to fix the scalability problem of NFS. Workload measurements showed that most file accesses are reads, files are usually accessed by one user at a time, and most files are small enough to cache entirely. This motivated the upload/download model and whole-file caching: when a file is opened, the entire file is downloaded to the client’s local disk. Reads and writes operate on the local copy. On close, if modified, the file is uploaded back. This gives session semantics: changes are visible to other clients only after the file is closed.
The mechanism that makes aggressive caching safe is the callback promise: when the server delivers a file to a client, it promises to notify the client if the file is modified. When a client uploads a modified file, the server sends callback revocations to all other clients that hold the file. Those clients invalidate their cached copies. Because files are read far more than they are written, most accesses proceed from the local cache with no server interaction at all.
AFS enforces a uniform global namespace: all AFS content appears under /afs on every client machine, with the cell name (e.g., cs.rutgers.edu) as the second path component. The same path resolves to the same file regardless of which client machine the user is on. NFS has no such guarantee; administrators mount remote directories at arbitrary local paths. File system content is organized into volumes that administrators can move between servers transparently via referrals.
Coda
Coda extended AFS to support laptops and mobile workstations that might lose network connectivity.
Coda allows a volume to be replicated across a Volume Storage Group (VSG). The subset of VSG servers reachable at any moment is the Accessible Volume Storage Group (AVSG).
-
When a file is opened, the client contacts all accessible servers and compares the file versions. If any replicated volume has an older version, the client initiates a resolution process, telling that server to get the latest updates.
-
For read operations, the client can read from any accessible server that has the current version of the file.
-
For write operations, a copy of the new file is uploaded to all accessible servers. Note that the servers don’t handle the replication behind the scenes.
When no server is reachable, the client enters disconnected operation mode and works entirely from its local disk cache.
-
Modifications during disconnection are recorded in a client modification log (CML). The CML logs file system operations (store, create, remove, rename) rather than file contents; the actual modified data stays in the local disk cache.
-
On reconnection, the client begins a process of reintegration: the CML is replayed in order. If the same file was modified by both the disconnected client and another client during the outage, a conflict is detected and flagged for manual resolution. There is no automatic merge.
-
Hoarding allows users to pre-populate the cache with specific files before going offline, ensuring those files are available during disconnection.
AFS and Coda are no longer widely deployed. AFS survives at some universities and research institutions, but its operational complexity and aging authentication model have made it difficult to justify in new deployments. Coda remained a research prototype.
SMB
Microsoft’s Server Message Block protocol was designed with the opposite philosophy from NFS: stateful, connection-oriented, and built to enforce Windows file-sharing semantics. SMB tracks every open file, every lock, and every byte range under lock at the server. This enabled mandatory locking, byte-range locks, and the semantics Windows applications expected. The cost was that server crashes lost all session state.
Opportunistic locks (oplocks) give the server a way to grant clients caching rights. The server monitors file access and sends an oplock break to the caching client when a conflict arises, requiring it to flush writes before the server allows the competing open. This is the same idea as AFS callbacks, applied at finer granularity. Later versions of Windows generalized oplocks into leases with cleaner semantics that can also cover directory metadata.
SMB 2 dramatically modernized the protocol with several performance improvements:
-
Pipelining: clients can send multiple requests before receiving responses, removing the need to wait for one reply before issuing the next.
-
Compounding: multiple related operations can be packed into a single network message, reducing round trips.
-
Durable handles: open file handles survive brief network disconnections, so clients can reconnect without re-establishing every open file and lock.
macOS adopted SMB 2 as its default file sharing protocol (replacing AFP). macOS also supports NFS for Unix-oriented environments, but SMB is the default.
NFSv4
NFSv4 abandoned statelessness. Clients now open and close files explicitly, and the server tracks state. Key improvements over NFSv2/v3:
-
Delegations: the server grants a client exclusive caching rights and recalls them when a conflict arises – the NFS equivalent of oplocks.
-
Compound RPC: multiple operations can be packed into a single request, reducing the round trips needed for sequences like path lookups.
-
Referrals: servers can redirect clients to alternative servers, enabling transparent file system migration.
-
Mandatory TCP: UDP is no longer used.
-
Strong authentication: Kerberos support is required, closing the trust-the-client-uid security hole.
The Convergence
The key mechanisms that modern NFS and SMB both now provide, starting from very different origins:
| Mechanism | NFS v2/v3 | NFSv4 | SMB 1 | SMB 2+ |
|---|---|---|---|---|
| Stateful server | No | Yes | Yes | Yes |
| Compound/pipelined requests | No | Yes | No | Yes |
| Client caching grants | No | Yes (delegations) | Yes (oplocks) | Yes (oplocks + leases) |
| Server-to-client notification | No | Yes | Yes | Yes |
| Referrals | No | Yes | Yes (via DFS) | Yes (via DFS) |
| Strong authentication | Optional | Mandatory | NTLM/Kerberos | Kerberos/NTLMv2 |
| Transport | UDP or TCP | TCP only | TCP | TCP |
NFS is dominant in Linux, Unix, and HPC environments. SMB is dominant in Windows enterprise environments and is the default on macOS.
Microsoft’s referral support comes via DFS (Distributed File System), a separate namespace service that has worked alongside SMB since the late 1990s. DFS maps logical paths to physical server locations and issues referrals when clients access those paths. It is not specific to SMB 2 or later; it predates SMB 2 and works across SMB versions.
Consistency Semantics Summary
-
POSIX (local): Reads always reflect the most recent write. All processes share a single coherent cache.
-
NFS (close-to-open): Freshness is checked on open. Stale reads are possible between opens.
-
AFS/Coda (session semantics): Changes become visible to other clients only after the file is closed. Last writer wins on conflict.
-
NFSv4/SMB with delegations or oplocks: The server manages caching rights and revokes them on conflict. Approaches POSIX consistency in the common case; still not identical.
What You Do Not Need to Memorize
-
Specific years (when protocols were released, when papers were published)
-
The names of researchers or paper authors
-
That AFS grew out of a specific named project or a particular university-industry collaboration
-
The specific oplock types in SMB (Level 1, Level 2, Batch, Filter) and their exact caching permissions
-
The detailed API differences between Chubby, ZooKeeper, and etcd
-
The internal details of Zab or how it differs from Raft
-
The list of NFSv2 RPCs
-
The details of specific authentication protocols (AUTH_UNIX, RPCSEC_GSS, Kerberos integration)
-
The specific non-Unix operating systems to which NFS was ported
-
SMB3 features
Week 7: Decentralized Storage
We cover three distinct approaches to distributing storage and lookup across many nodes: data-intensive distributed file systems (GFS and HDFS), distributed hash tables (CAN, Chord, and Dynamo), and DNS as a planet-scale distributed data store.
Distributed File Systems: GFS and HDFS
Goals and Workload
The Google File System (GFS) was designed for a very specific workload: enormous files (multi-GB), sequential reads and writes, a high rate of concurrent appends, and an environment where commodity hardware failures are routine. It prioritizes throughput over latency and fault tolerance by design rather than by exception.
The key insight driving the design is that if you know your workload in advance, you can build a system optimized for it rather than a general-purpose system that handles everything adequately.
Separation of Data and Metadata
GFS separates the control plane (metadata) from the data plane (file content). A single master node holds all metadata in memory: the namespace, the mapping of files to chunks, and chunk replica locations. Clients contact the master only to find out where data lives. All actual data transfer happens directly between clients and chunkservers, bypassing the master entirely.
This design keeps the master out of the data path, allowing it to handle many concurrent clients without becoming a throughput bottleneck.
Chunks and Replication
Files are divided into fixed-size chunks of 64 MB, each identified by a unique 64-bit handle. Each chunk is stored as an ordinary file on a chunkserver’s local disk and replicated on three chunkservers by default. Replication is what makes the system fault-tolerant: if one chunkserver fails, two others still hold the data.
Master State and Fault Tolerance
The master stores metadata in memory for fast access. Namespace and file-to-chunk mappings are persisted to an operation log that is replicated on remote machines. Chunk locations are not persisted; the master rebuilds them by polling chunkservers at startup.
The operation log is the authoritative record of the file system. The master periodically checkpoints its state so that log replay after a crash is fast.
The Two-Phase Write
Writes in GFS involve two distinct phases, and understanding why they are separated matters.
Phase 1 (data transfer): The client pushes data to all replicas in a pipelined chain, from one chunkserver to the next. No writes occur yet; the data is simply buffered at each replica.
Phase 2 (write request): Once all replicas confirm receipt, the client sends a write request to the primary replica, the one holding the current lease for that chunk. The primary assigns a serial number to the mutation and applies it locally. It then forwards the write request to all secondaries, which apply the mutation in the same order. The primary acknowledges success to the client after all secondaries respond.
Separating data flow from control flow improves network efficiency. The pipelining in phase 1 ensures each network link carries the data exactly once. The primary’s lease serializes concurrent mutations in phase 2, ensuring all replicas apply changes in the same order.
Additional Operations
GFS adds two operations beyond the standard file interface. Record append lets GFS choose the offset and guarantees that the data appears atomically at least once, even with concurrent appenders; this supports producer-consumer workflows without explicit locking. Snapshot creates a copy of a file or directory tree cheaply using copy-on-write: chunks are shared with the original until one is modified.
Fault Detection
Chunkservers send heartbeats to the master. A chunkserver that stops sending heartbeats is marked as failed; its chunks are re-replicated from surviving copies. Each chunk stores checksums over its data; a chunkserver detects corruption by verifying checksums on read or during background scans.
When granting a lease, the master also increments the chunk version number and notifies all replicas. Any replica that missed the update because it was offline will have a stale version number; the master will not direct clients to it and will schedule re-replication from an up-to-date copy.
Client Caching
GFS clients do not cache file data. They do cache chunk location metadata returned by the master, with a timeout, so they can read from chunkservers repeatedly without re-querying the master for every request.
The Namespace
GFS does not use per-directory data structures. The namespace is a single flat lookup table mapping full pathnames to metadata. There are no hard links or symbolic links.
HDFS: What Carried Over and What Changed
HDFS (Hadoop Distributed File System) is essentially an open-source re-implementation of GFS in Java. The one-to-one mapping is:
-
master = NameNode
-
chunkserver = DataNode
-
chunk = block
The architecture, replication strategy, and fault-detection mechanisms remain the same.
Some key differences are:
-
HDFS uses a 128 MB default block size (vs. GFS’s 64 MB).
-
The original HDFS NameNode was a single point of failure with no automatic failover; later versions added high availability using ZooKeeper.
-
HDFS Federation allows multiple independent NameNodes for different portions of the namespace, addressing the metadata scalability limit of the single-master design.
Beyond GFS
GFS served Google well for nearly a decade. As files became smaller and more numerous, the single master became a metadata bottleneck. Google replaced GFS with Colossus, which distributes the metadata function. HDFS followed the same direction with Federation. Both converged on the same lesson: at sufficient scale, metadata is itself a distributed systems problem.
Dropbox
Dropbox applies the same core principle as GFS (separate data from metadata) to a consumer file synchronization service. Its design decisions are worth understanding because it faced a different workload and scaling problem than internal batch-processing systems.
The original design goal of Dropbox was to keep a designated folder synchronized across all of a user’s devices. Any change on one device should propagate to the server and then to all other connected devices, transparently and in the background.
Unusual read/write ratio: Most content sites (social media, news, streaming) are heavily read-dominated: data is written once and read many times. Dropbox is close to a 1:1 ratio, because stored data is rarely read except to push a change to another device. This means Dropbox must handle an unusually high volume of uploads relative to its user base.
Chunking and deduplication: Dropbox splits files into fixed-size blocks, identifies each block by its SHA-256 hash, and uploads only the blocks the server does not already have. If two users store identical content, only one copy exists on the server. This dramatically reduces storage and upload volume.
The metadata server: Like GFS, Dropbox separates block data from metadata. Block data (actual file content) lives in a scalable block store. The metadata server stores file names, directory structure, the list of blocks that make up each file, and version history. When a client downloads a changed file, it asks the metadata server for the current block list and then fetches only the blocks it is missing from the block store. The metadata server is never in the data path for actual file content.
Polling vs. notifications: Early Dropbox clients polled the server every few seconds, asking “anything new?” With tens of thousands of clients polling simultaneously, most server responses were just “no.” The solution was a notification server. Clients hold a persistent connection to the notification server. When a change occurs, the server pushes a message to the affected clients, telling them to sync. Clients then contact the metadata server to find out what changed. This shifts the server from constantly answering empty polls to only sending messages when there is actually something to say.
Distributed Hash Tables
A Distributed Hash Table (DHT) is a decentralized system that provides a key-value lookup interface: given a key, find the value. There is no central server. Responsibility for keys is divided among participating nodes, and any node can route a query to the correct node in a bounded number of hops.
The motivation was peer-to-peer networks. Centralized index servers (like Napster) are vulnerable to shutdown. A DHT provides the same lookup functionality without any central authority.
Consistent Hashing
All DHTs rely on consistent hashing. Keys are hashed into a large identifier space, and each node is responsible for a contiguous range of values within that space. Adding a new node takes over a portion of an existing node’s range; removing a node transfers its range to neighboring nodes. In either case, only the keys in the affected range move. This minimizes key movement under membership changes. Different DHTs implement the range assignment differently: CAN uses geometric zones in a multi-dimensional space; Chord uses a one-dimensional ring with successor pointers.
CAN
CAN (Content Addressable Network) maps keys to points in a d-dimensional Cartesian coordinate space. Each node owns a rectangular zone. A lookup for a key hashes to a point in the space and routes to that point: each hop moves closer. The expected lookup time is O(d · n^(1/d)) hops with O(d) neighbors per node (roughly 2d neighbors for a d-dimensional space). Node joins split an existing zone; departures merge zones with a neighbor.
Chord
Chord organizes nodes and keys on a one-dimensional ring using SHA-1 hashes. A key belongs to its successor: the node with the smallest ID greater than or equal to the key’s hash. Pure successor-pointer routing takes O(n) hops. The finger table at each node stores pointers to nodes at exponentially increasing distances around the ring, enabling O(log n) lookup. Node joins and departures are managed by a stabilization protocol that keeps successor pointers and finger tables up to date.
Amazon Dynamo
Dynamo is Amazon’s internal key-value store, designed to power shopping cart and similar services with strict availability and latency requirements. It builds on consistent hashing but adds mechanisms that make it a complete production storage system rather than a lookup primitive.
Virtual nodes: Each physical server owns many positions (vnodes) on the ring. This balances the load more evenly and spreads the impact of failures and joins across many existing nodes rather than concentrating it on one.
Replication: Each key is replicated on N consecutive nodes on the ring (N is typically 3). This preference list is known to all nodes.
Quorum reads and writes: A write completes after W replicas acknowledge it; a read completes after R replicas respond. With R + W > N, every read is guaranteed to overlap with the most recent write. With N=3, R=2, W=2, the system tolerates one replica failure without blocking either reads or writes.
Conflict resolution: Dynamo uses eventual consistency. Concurrent writes may produce conflicting versions. Dynamo attaches vector clocks to values so that the system (and application) can detect divergence. Applications are responsible for merging conflicts; for a shopping cart, this means taking the union of both versions.
Hinted handoff: When a preferred replica is unavailable, a write goes to a different node with a “hint” indicating its intended destination. The hinted node delivers the write to the intended replica once it recovers.
Gossip protocol: Membership and failure information propagate through the cluster via gossip: periodic exchanges with randomly selected peers rather than through a central coordinator.
Dynamo vs. Classic DHTs
| Property | Chord | Dynamo |
|---|---|---|
| Purpose | Decentralized routing | Production key-value store |
| Routing | O(log n) hops through intermediaries | One-hop to target (full ring state known) |
| Consistency | Not a data store; n/a | Eventual consistency |
| Replication | Not included | N replicas, quorum (R, W) |
| Deployment | Designed for open internet/churn | Stable data center environment |
| Conflict handling | n/a | Vector clocks + application merge |
Domain Name System (DNS)
DNS maps human-readable domain names to IP addresses. It is a hierarchical, distributed, cacheable database that handles hundreds of billions of queries per day without a central server.
Hierarchy and Delegation
The DNS namespace is a tree. Responsibility is delegated at each level: the root delegates top-level domains (TLDs, like .edu), TLD operators delegate second-level domains (like rutgers.edu), organizations delegate subdomains (like cs.rutgers.edu). Each delegated zone is managed independently by its owner. No single server knows all mappings.
Resolution
When an application looks up a name, it calls a local library function that hands the query to a stub resolver: a minimal client built into the operating system that checks the local hosts file, checks its cache, and if neither has the answer, forwards the query to a recursive resolver.
The recursive resolver (typically provided by an ISP or a public service like 8.8.8.8) does the actual work. It performs iterative resolution by walking down the hierarchy.
-
It starts by contacting a root server. Root servers are themselves authoritative servers for the root zone: they do not know the final answer, but they know which servers are authoritative for each TLD and refer the resolver there.
-
The resolver then contacts the TLD authoritative server, which refers it to the authoritative server for the specific domain. That server holds the actual record and returns the answer.
-
The resolver passes the result back to the stub resolver, which returns it to the application.
Each level in this chain is authoritative for its own zone and only its own zone. No server needs global knowledge; each one knows only who to refer to next.
Caching and TTL
Every DNS response carries a Time To Live (TTL). Resolvers cache responses for the TTL duration, answering subsequent queries without hitting authoritative servers. Caching is aggressive and layered: the recursive resolver, the OS, and the browser all cache. TTL is a tunable trade-off: longer TTL means lower load on authoritative servers but slower propagation of updates; shorter TTL means faster updates but higher query load.
Limitations
DNS is not strongly consistent: clients may hold stale cached answers until the TTL expires. It is read-mostly: updates propagate through TTL expiry, not through a distributed write protocol. It is not searchable: you can look up a name but not query across names. And it was not designed with security: DNS spoofing and cache poisoning are real attacks; DNSSEC adds cryptographic signatures to mitigate them but deployment is incomplete.
DNS as a Design Pattern
DNS illustrates what makes a distributed system scale globally: hierarchical organization, delegation, aggressive caching with bounded staleness, and a workload that is overwhelmingly read-heavy. These properties keep the query load from concentrating on any single server while allowing each zone owner to manage its own data independently.
What You Don’t Need to Study
-
The specific block size used by Dropbox
-
Specific hash function bit widths or ring sizes used in Chord or Dynamo implementations
-
The original GFS, Chord, CAN, or Dynamo paper authors and years
-
GFS cluster sizes from the Google paper
-
The DNSSEC protocol (secure DNS - adding signatures to DNS responses)
-
Specific DNS record types (A, AAAA, MX, CNAME, etc.) beyond understanding that they exist
-
The history of HDFS version numbers or release dates
-
Any details of Colossus’s internal design (it is not published)
Week 8: Distributed Transactions
Transactions and ACID
A transaction is a sequence of operations treated as a single logical unit of work. A transaction either commits (all changes made permanent) or aborts (all changes rolled back). Transactions are specifically designed to be abortable: aborting leaves no partial state behind. The write-ahead log makes this work: changes are written to a sequential log before being applied to data, allowing recovery to redo committed transactions and undo incomplete ones after a crash.
The ACID properties define correctness for transactions: Atomicity (all-or-nothing), Consistency (valid state to valid state), Isolation (concurrent transactions do not observe each other’s partial results), and Durability (committed changes survive crashes). These properties are straightforward to provide on a single machine. In a distributed setting, each property requires expensive coordination: atomicity needs 2PC, isolation needs distributed locking, and the availability cost multiplies with every participant.
Note that the “consistency” in ACID refers to application-level data integrity constraints. The “consistency” in distributed systems and the CAP theorem refers to what values a read is allowed to return across replicas. The two uses of the word are unrelated.
Concurrency Control
Concurrency control enforces isolation. The standard goal is serializability: concurrent transactions must produce results equivalent to some serial execution. A schedule is a sequence of interleaved operations from concurrent transactions; a serializable schedule is one equivalent to some serial execution.
There are two main approaches: pessimistic concurrency control assumes conflicts are likely and prevents them using locks; optimistic concurrency control assumes conflicts are rare and checks for them only at commit time.
Two-phase locking (2PL) is the standard pessimistic protocol. A transaction has a growing phase (acquires locks, releases none) and a shrinking phase (releases locks, acquires none). This rule prevents the interleavings that produce inconsistent reads. Strict 2PL holds write locks until commit or abort, preventing cascading aborts. SS2PL holds all locks until commit or abort and is implemented by many commercial databases.
Read (shared) locks allow multiple concurrent readers. Write (exclusive) locks grant exclusive access. Multiple read locks can coexist; a write lock conflicts with all other locks.
Optimistic concurrency control (OCC) proceeds in three phases: working (reads and writes go to a private workspace, no locks held), validation (check for conflicts at commit time), and update (apply the workspace if validation passes). OCC is deadlock-free and efficient when conflicts are rare, but transactions may be aborted and restarted after completing all their work.
Multi-Version Concurrency Control (MVCC) maintains multiple versions of each data item. A common design gives each transaction a snapshot at start time and returns the newest committed version visible in that snapshot. This is called snapshot isolation. Different systems implement the exact visibility rules differently, but reads never block because they always draw from a stable snapshot. Write-write conflicts are resolved at commit time by a first-committer-wins rule.
Deadlock
Deadlock arises from locking: a set of transactions each hold locks needed by another in the set, forming a cycle nobody can break. Four conditions must hold simultaneously: mutual exclusion, hold and wait, non-preemption, and circular wait. OCC and MVCC do not hold blocking locks and are therefore deadlock-free.
The wait-for graph (WFG) represents lock dependencies: an edge from T1 to T2 means T1 is waiting for a lock held by T2. A cycle indicates deadlock. In a distributed system, each node sees only its local WFG edges; a deadlock can span multiple machines with no single node seeing the full cycle.
Three practical approaches exist. Ignoring deadlocks relies on application-level timeouts; acceptable in some systems but not in transactional databases, where a slow-but-live node triggers the same timeout as a genuinely deadlocked one. Detection finds cycles; Prevention makes cycles structurally impossible.
Centralized detection has one node collect all local WFGs and search for cycles. It is simple but produces phantom deadlocks: false positives caused by asynchronous snapshot collection, where an edge appears in the global graph after the underlying lock has already been released. The Chandy-Misra-Haas algorithm avoids the global snapshot by chasing edges with probe messages. A blocked transaction T0 sends a probe to the node holding the resource. The probe propagates along dependency edges; if it returns to T0, a cycle exists. When a deadlock is confirmed, the system aborts at least one transaction, typically the youngest or the one that has done the least work.
Timestamp-based prevention assigns each transaction a unique timestamp at start time and uses it to decide who waits and who aborts on a conflict, ensuring that WFG edges always point in the same direction so cycles are impossible. The two standard schemes are wait-die and wound-wait.
Two-Phase Commit (2PC)
Two-Phase Commit ensures all nodes in a distributed transaction either all commit or all abort.
The protocol uses a coordinator-participant model. In Phase 1 (Prepare/Voting), the coordinator sends a PREPARE message to all participants. Each participant checks whether it can commit, writes a prepare record to stable storage, and responds YES or NO. A YES vote is a durable promise: the participant must be able to commit even after a crash. If a participant fails to respond, the coordinator waits and keeps retrying; it does not treat silence as a NO. The protocol assumes a fail-recover model.
In Phase 2 (Commit or Abort), if all participants voted yes, the coordinator writes a commit record and broadcasts COMMIT. If any participant voted no, the coordinator broadcasts ABORT. Participants execute the decision and release their locks.
2PC requires unanimous agreement, not a majority. Any single participant can veto, and any unresponsive participant blocks the protocol indefinitely. This is inherent: the transaction must complete everywhere or nowhere, so no majority shortcut is available.
The critical vulnerability is the uncertain state: a participant that has voted yes but has not received the coordinator’s decision cannot unilaterally commit or abort. If the coordinator fails in this window, all participants block with locks held. 2PC is a blocking protocol. The practical remedy is to replace the single coordinator with a Raft- or Paxos-replicated group.
Availability cost: 2PC chains availability multiplicatively. At 99.9% per database, a five-database transaction has only 0.999^5 ≈ 99.5% availability, close to two days of downtime per year. This cost is a major driver of BASE-style architectures at internet scale.
Three-Phase Commit (3PC)
Three-Phase Commit was designed to eliminate 2PC’s blocking behavior. It inserts a PreCommit phase between voting and the final commit so a recovery coordinator can determine the intended decision: if any participant has seen PreCommit, commit; if none has, abort.
3PC eliminates blocking under single-node failures but assumes a synchronous network with bounded message delay. A partition during the PreCommit phase can cause split-brain behavior. Because of this, 3PC is not used in practice.
How 2PC Relates to Raft, Paxos, and Virtual Synchrony
Virtual Synchrony provides atomic multicast within a process group and is fast, but cannot survive network partitions. Raft and Paxos are fault-tolerant consensus algorithms that use majority agreement and are useful for making the 2PC coordinator fault-tolerant, but neither has a concept of a participant vetoing a decision and neither can substitute for 2PC itself. 2PC uses unanimous agreement and is designed specifically for transactional atomicity. In practice, Raft or Paxos replicates state within each participant group, and 2PC coordinates across groups.
ACID
ACID formalizes the correctness requirements that concurrency control and 2PC are designed to satisfy.
Atomicity is all-or-nothing; 2PC achieves this across multiple nodes. Consistency ensures valid state transitions; the database enforces integrity constraints by aborting violations. Isolation prevents concurrent interference; locking, OCC, and MVCC enforce this. Durability ensures committed changes survive crashes; the write-ahead log provides this.
Consistency Models
Consistency models define what values a read is allowed to return given a history of writes across replicas. Stronger models give more intuitive guarantees at higher coordination cost.
Linearizability is the strongest practical model. Every operation appears to take effect instantaneously at some point between its invocation and completion, in an order consistent with real time. Two key ideas follow from this:
-
Each operation appears atomic (all at once, not partially), and
-
If one operation finishes before another begins, the first must appear earlier in everywhere.
Linearizability does not require wall-clock timestamps for overlapping operations: if two operations overlap in time, either order is valid; only non-overlapping operations must respect real-time ordering. etcd provides linearizability for all operations; ZooKeeper provides linearizable writes but sequentially consistent reads by default. Linearizability is the definition of “C” in CAP.
Sequential consistency relaxes the real-time requirement of linearizability. There must exist some global total order of all operations consistent with each process’s program order, but that order need not match wall-clock time.
Under linearizability, if a write completes before a read begins, the read must see that write. Under sequential consistency, the system may order them differently as long as each client’s own program order is preserved.
Causal consistency only requires causally related operations to appear in the same order for all processes. Causally independent operations may be seen in different orders. Vector clocks implement this model.
Eventual consistency is the weakest useful model. All replicas will eventually converge if no new updates arrive, but there is no constraint on what reads return in the interim.
Serializability and Linearizability
Serializability is a property of transactions (multi-step, multi-object operations). It requires that concurrent transactions produce results equivalent to some serial execution, with no real-time constraint on which order is chosen.
Linearizability is a property of individual operations on a single object. It requires each operation to appear instantaneous in an order consistent with real time.
The two properties are independent: a database can be serializable without being linearizable, and a key-value store can be linearizable without supporting transactions.
The CAP Theorem
The CAP theorem states that when a network partition occurs, a distributed system cannot simultaneously guarantee both consistency (specifically linearizability) and availability. Partition tolerance is not a design choice; real networks partition. The practical choice is between C and A when a partition actually happens.
-
CP systems return errors during a partition rather than serve potentially stale data.
-
AP systems continue to serve requests but may return stale data.
CAP is commonly summarized as “you can have at most two of C, A, and P.” That framing is imprecise: partition tolerance is a constraint imposed by the network, not a property you trade away. The C-versus-A trade-off only arises during a partition; when the network is healthy, a well-designed system can provide both.
PACELC
PACELC extends CAP by observing that even during normal operation, there is a trade-off between latency and consistency. Strong consistency requires coordinating writes across a quorum before responding, which adds latency. Eventual consistency allows a local replica response, which is faster. PACELC classifies systems as PA/EL, PA/EC, or PC/EC. Cassandra and Dynamo are PA/EL; Spanner and HBase are PC/EC.
BASE
BASE (Basically Available, Soft State, Eventually Consistent) is the design philosophy adopted by large-scale internet systems as a response to the constraints revealed by CAP and PACELC. If strong consistency imposes availability costs during partitions and latency costs during normal operation, you design systems that accept weaker consistency in exchange for higher availability and lower latency. BASE is not a protocol; it shifts the burden of handling inconsistency from the system to the application.
ACID vs. BASE
The choice is driven by application requirements. Financial transfers and medical records need ACID; social feeds and product catalogs can tolerate BASE semantics. Many modern systems are hybrid, using ACID for core transactional data and BASE for derived or display data.
Key Relationships
| Concept | Notes |
|---|---|
| Concurrency control | Pessimistic (2PL/SS2PL) vs. optimistic (OCC) vs. versioned (MVCC) |
| Commit protocol | 2PC (unanimous, blocking, practical) vs. 3PC (non-blocking but impractical) |
| Consistency model strength | Linearizability → Sequential → Causal → Eventual |
| CAP choice during partition | CP (block rather than serve stale) vs. AP (serve stale rather than block) |
| Design philosophy | ACID (correctness over availability) vs. BASE (availability over consistency) |
You Don’t Need to Study
The following topics are covered in the lecture notes for completeness but will not appear on exams or homework.
-
Sagas. A microservices design pattern worth knowing exists but outside the scope of this course.
-
The difference between wait-die and wound-wait. You should know that timestamp-based schemes prevent deadlock by construction; you do not need to memorize which rule applies to which case.
-
Strong serializability. You should understand that serializability and linearizability are independent properties; their combination is covered in the notes but is not a focus of assessment.
-
Sequential consistency. You should know that linearizability requires real-time ordering and that causal consistency requires ordering of causally related operations. Sequential consistency sits between them but is not the target model of any common system you are likely to encounter.
-
The distinction between strict 2PL and SS2PL. You should know that holding all locks until commit or abort prevents cascading aborts and is standard practice. You do not need to know the formal naming difference between the two variants.
-
CRDTs and Strong Eventual Consistency (SEC). You should know that eventually consistent systems need a strategy for reconciling conflicting concurrent writes. CRDTs and SEC are one principled answer, but the details are beyond the scope of this course.
-
Leases. A useful concept in distributed lock management, but not a focus of this course.