Coordination Services
Distributed systems frequently need to answer questions like: who is the current leader? Is this lock held? What is the current configuration? Getting these answers wrong can be catastrophic. If two nodes both believe they are the leader, they will independently accept writes and the system state will diverge.
The natural approach is a dedicated replicated coordinator made fault-tolerant through consensus. A coordination service is exactly that: a small, strongly consistent, highly available store that distributed applications use to coordinate decisions and share small amounts of control-plane state. Every coordination service we study – Chubby, ZooKeeper, and etcd – uses consensus internally. This is not incidental: the problems they solve are precisely the problems that require consensus.
What Coordination Services Provide
All three services share a core set of capabilities:
-
A persistent, consistent store for small amounts of metadata (not application data)
-
Leases or session timeouts that automatically expire when a client fails, allowing self-cleaning locks and registrations
-
Watches or events that notify clients when data changes, eliminating polling
-
Atomic operations (compare-and-swap or conditional transactions) that prevent races when multiple clients contend for the same resource
-
Strong consistency guarantees so that all clients see the same current state
Chubby
Chubby is a lock service and configuration store from Google. A Chubby deployment is a cell of five replicas, one of which is the master elected via Paxos. The other replicas participate in consensus but redirect client requests to the master. Three of five replicas must be alive for the cell to function, allowing two simultaneous failures.
Chubby exposes a file system interface: locks, configuration data, and service addresses are all stored as named files in a hierarchical namespace. Locks are advisory and coarse-grained, meaning they are held for long periods and are not enforced by the system if code chooses to ignore them.
When a client opens a file it receives a lease – a time-bounded guarantee that its cached copy is valid. The master sends cache invalidations to other clients when a file is written. If the master fails, a new one is elected, broadcasts a new epoch, and gives clients a grace period to reconnect. Clients that do not reconnect within the grace period have their sessions and locks released.
ZooKeeper
ZooKeeper was developed at Yahoo and is open-source. Rather than providing locks as a primitive, it provides building blocks from which locks, leader election, barriers, and other coordination patterns can be constructed.
Data is stored in a tree of znodes, each holding a small amount of data. There are two types of znodes:
-
Persistent znodes survive client disconnection and remain until explicitly deleted.
-
Ephemeral znodes are automatically deleted when the client session that created them ends. This is the key mechanism for failure detection.
Either type can optionally be created as a sequential znode, which causes ZooKeeper to append a monotonically increasing integer to the name. This is essential for implementing locks without thundering-herd problems.
The thundering herd problem occurs when many clients are all waiting for the same condition and a single state change wakes them all simultaneously. They all rush to retry at once, creating a burst of load on the coordination service, while only one of them can make progress. In a ZooKeeper lock implementation, the fix is to have each waiting client watch only its immediate predecessor in the queue rather than the lock node itself. When the lock is released, exactly one client is notified instead of all of them.
ZooKeeper uses a consensus protocol similar to Raft to replicate writes through a leader in a globally consistent order. Reads can be served by any replica and are sequentially consistent; clients can issue a sync to ensure a replica has caught up to recently committed writes before reading.
Watches are one-shot notifications. A client sets a watch when it reads a znode (checking existence, data, or children), and ZooKeeper delivers an event if the relevant state changes. The watch fires once and is then removed; the client must re-register if it wants continued monitoring. The usual pattern is to treat the event as “something changed,” re-read the current state from ZooKeeper, and re-register the watch. This keeps clients consistent even if multiple changes occur quickly or during a brief disconnection.
etcd
etcd was created by CoreOS and is the authoritative store for Kubernetes cluster state. It uses Raft for consensus. Unlike ZooKeeper’s hierarchical namespace, etcd stores a flat key-value map: keys are arbitrary byte strings, and hierarchy is a naming convention, not an enforced structure. Prefix range queries and prefix watches provide directory-like behavior.
etcd routes reads through the leader by default, giving linearizable reads – reads that reflect the most recently committed write, as if the entire system had a single consistent view at that instant. Serializable reads (served locally by any replica) are available as an opt-in for workloads that can tolerate slight staleness. Leases with TTLs provide the same self-cleaning behavior as ZooKeeper’s ephemeral znodes. Transactions with compare-and-swap conditions enable atomic leader election and lock acquisition.
Common Coordination Patterns
These patterns apply to all three coordination services; only the specific primitives differ.
Leader election. Replicas contend for a well-known name in the coordination service. Exactly one wins. The winner’s claim disappears when its session expires, allowing others to contend again.
Distributed locks. Acquiring the lock is a write through consensus, giving global ordering. Locks built on ephemeral nodes or leases are self-cleaning: a crashed holder’s session expires and the lock releases automatically. Coordination services are suited to coarse-grained locks: locks held for long periods protecting large resources (a master election, a configuration update). They are not suited to fine-grained locks held for milliseconds to protect individual rows or records. High-frequency lock acquisitions and releases would overwhelm a system built around consensus.
Configuration management. Services store configuration in the coordination service. Updates go through consensus and are applied consistently. Clients watch configuration keys for changes.
Service discovery. A running instance registers its address under a known prefix using an ephemeral key. The list of active instances stays current because dead servers’ keys expire automatically.
Fencing tokens. A monotonically increasing number associated with each lock grant. The protected resource rejects any request carrying a token lower than the highest it has seen, preventing a stale lock holder that woke up after a pause from corrupting shared state.
What Coordination Services Do Not Do
A coordination service stores small amounts of metadata. It is not a database, not a message queue, and not suitable for data-plane operations or high-throughput writes. A useful rule of thumb: if the data is on the critical path of every client request, it does not belong in a coordination service.
More fundamentally, electing a leader through a coordination service does not guarantee that the application as a whole behaves correctly. Coordination serializes decisions; application correctness is still the developer’s responsibility.
Network-Attached Storage
Access Transparency and VFS
The goal of networked file systems is access transparency: applications use standard file system calls (open, read, write, close) against remote files without any awareness that the storage is remote. This is achieved through the Virtual File System (VFS) layer, adopted by every major Unix-derived OS. VFS defines a standard interface that any file system driver must implement. The kernel always talks to this interface; whether the driver beneath it issues disk commands or sends network requests is invisible to applications.
Mount points attach different file systems into a single directory tree. A remote file system client is a VFS driver that translates standard file operations into network requests. When the response arrives, the result passes back through the VFS interface to the application.
Design Dimensions
Every networked file system must navigate three fundamental tradeoffs:
Consistency. Multiple clients may cache the same file. Keeping those caches consistent requires either frequent polling against the server or a protocol where the server pushes invalidations to clients.
State. A stateless server holds no information about client activity between requests. Every request is self-contained. Crash recovery is trivial because there is nothing to recover. But statelessness makes locks, open file tracking, and cache invalidation impossible. A stateful server enables richer semantics at the cost of recovery complexity: after a crash, open files, locks, and cached state must be rebuilt or cleaned up.
Caching. Options range from write-through (immediate server update), to write-behind (delayed batch), to write-on-close / session semantics (send changes only on close). Callbacks, where the server tracks which clients have cached a file and pushes invalidations on modification, require statefulness but eliminate polling.
NFS
NFS was designed to be simple, stateless, and interoperable across any networked system. It was built on openly published RPC and data encoding standards and was ported to many operating systems in both client and server roles.
Because the server is stateless, NFSv2 has no OPEN, CLOSE, LOCK, SEEK, or APPEND procedures. Clients identify files by file handles: opaque server-generated identifiers that persist across server restarts. The server uses UDP as the transport because the stateless, idempotent design makes retries safe.
Key limitations of stateless NFS:
-
No native locking. File locking was added through a separate Network Lock Manager (NLM) service, an awkward retrofit whose state was not part of NFS’s crash recovery.
-
No safe append. There is no atomic append operation; a client must read the file size then write at that offset, which is a race condition when multiple writers are active.
-
No open file reference tracking. A file can be deleted on the server while a client still has it open. NFS clients work around this with “silly renames” before sending a REMOVE.
-
Weak security. Original NFS trusted the user ID sent by the client without verification.
NFS caches data in blocks and validates cached data using timestamp comparison: the client checks the file’s modification time on the server when a file is opened, and after a short validity timeout. This gives close-to-open consistency: stale reads are possible between opens.
AFS
AFS was designed to fix the scalability problem of NFS. Workload measurements showed that most file accesses are reads, files are usually accessed by one user at a time, and most files are small enough to cache entirely. This motivated the upload/download model and whole-file caching: when a file is opened, the entire file is downloaded to the client’s local disk. Reads and writes operate on the local copy. On close, if modified, the file is uploaded back. This gives session semantics: changes are visible to other clients only after the file is closed.
The mechanism that makes aggressive caching safe is the callback promise: when the server delivers a file to a client, it promises to notify the client if the file is modified. When a client uploads a modified file, the server sends callback revocations to all other clients that hold the file. Those clients invalidate their cached copies. Because files are read far more than they are written, most accesses proceed from the local cache with no server interaction at all.
AFS enforces a uniform global namespace: all AFS content appears under /afs on every client machine, with the cell name (e.g., cs.rutgers.edu) as the second path component. The same path resolves to the same file regardless of which client machine the user is on. NFS has no such guarantee; administrators mount remote directories at arbitrary local paths. File system content is organized into volumes that administrators can move between servers transparently via referrals.
Coda
Coda extended AFS to support laptops and mobile workstations that might lose network connectivity. Key concepts:
-
Volumes can be replicated across a Volume Storage Group (VSG). The subset of VSG servers reachable at any moment is the Accessible Volume Storage Group (AVSG).
-
When no server is reachable, the client enters disconnected operation mode and works entirely from its local disk cache.
-
Modifications during disconnection are recorded in a client modification log (CML). The CML logs file system operations (store, create, remove, rename) rather than file contents; the actual modified data stays in the local disk cache. On reconnection, the CML is replayed in order. If a modified file was evicted from the cache before reconnection, that data is lost.
-
On reconnection, the CML is replayed. If the same file was modified by both the disconnected client and another client during the outage, a conflict is detected and flagged for manual resolution. There is no automatic merge.
-
Hoarding allows users to pre-populate the cache with specific files before going offline, ensuring those files are available during disconnection.
AFS and Coda are no longer widely deployed. AFS survives at some universities and research institutions, but its operational complexity and aging authentication model have made it difficult to justify in new deployments. Coda remained a research prototype.
SMB
Microsoft’s Server Message Block protocol was designed with the opposite philosophy from NFS: stateful, connection-oriented, and built to enforce Windows file-sharing semantics. SMB tracks every open file, every lock, and every byte range under lock at the server. This enabled mandatory locking, byte-range locks, and the semantics Windows applications expected. The cost was that server crashes lost all session state.
Opportunistic locks (oplocks) give the server a way to grant clients caching rights. The server monitors file access and sends an oplock break to the caching client when a conflict arises, requiring it to flush writes before the server allows the competing open. This is the same idea as AFS callbacks, applied at finer granularity. Later versions of Windows generalized oplocks into leases with cleaner semantics that can also cover directory metadata.
SMB 2 dramatically modernized the protocol with several performance improvements:
-
Pipelining: clients can send multiple requests before receiving responses, removing the need to wait for one reply before issuing the next.
-
Compounding: multiple related operations can be packed into a single network message, reducing round trips.
-
Durable handles: open file handles survive brief network disconnections, so clients can reconnect without re-establishing every open file and lock.
SMB 3 added high-availability and datacenter features: Transparent Failover lets a client survive the failure of one node in a clustered file server without losing open files or locks, and SMB Multichannel allows a session to use multiple network interfaces simultaneously for throughput and redundancy. SMB 3 also added protocol-level encryption.
macOS adopted SMB 2 as its default file sharing protocol (replacing AFP). macOS also supports NFS for Unix-oriented environments, but SMB is the default.
NFSv4
NFSv4 abandoned statelessness. Clients now open and close files explicitly, and the server tracks state. Key improvements over NFSv2/v3:
-
Delegations: the server grants a client exclusive caching rights and recalls them when a conflict arises – the NFS equivalent of oplocks.
-
Compound RPC: multiple operations can be packed into a single request, reducing the round trips needed for sequences like path lookups.
-
Referrals: servers can redirect clients to alternative servers, enabling transparent file system migration.
-
Mandatory TCP: UDP is no longer used.
-
Strong authentication: Kerberos support is required, closing the trust-the-client-uid security hole.
The Convergence
The key mechanisms that modern NFS and SMB both now provide, starting from very different origins:
| Mechanism | NFS v2/v3 | NFSv4 | SMB 1 | SMB 2+ |
|---|---|---|---|---|
| Stateful server | No | Yes | Yes | Yes |
| Compound/pipelined requests | No | Yes | No | Yes |
| Client caching grants | No | Yes (delegations) | Yes (oplocks) | Yes (oplocks + leases) |
| Server-to-client notification | No | Yes | Yes | Yes |
| Referrals | No | Yes | Yes (via DFS) | Yes (via DFS) |
| Strong authentication | Optional | Mandatory | NTLM/Kerberos | Kerberos/NTLMv2 |
| Transport | UDP or TCP | TCP only | TCP | TCP |
NFS is dominant in Linux, Unix, and HPC environments. SMB is dominant in Windows enterprise environments and is the default on macOS.
Microsoft’s referral support comes via DFS (Distributed File System), a separate namespace service that has worked alongside SMB since the late 1990s. DFS maps logical paths to physical server locations and issues referrals when clients access those paths. It is not specific to SMB 2 or later; it predates SMB 2 and works across SMB versions.
Consistency Semantics Summary
-
POSIX (local): Reads always reflect the most recent write. All processes share a single coherent cache.
-
NFS (close-to-open): Freshness is checked on open. Stale reads are possible between opens.
-
AFS/Coda (session semantics): Changes become visible to other clients only after the file is closed. Last writer wins on conflict.
-
NFSv4/SMB with delegations or oplocks: The server manages caching rights and revokes them on conflict. Approaches POSIX consistency in the common case; still not identical.
What You Do Not Need to Memorize
-
Specific years (when protocols were released, when papers were published)
-
The names of researchers or paper authors
-
That AFS grew out of a specific named project or a particular university-industry collaboration
-
The specific oplock types in SMB (Level 1, Level 2, Batch, Filter) and their exact caching permissions
-
The detailed API differences between Chubby, ZooKeeper, and etcd
-
The NFSv2 procedure numbers
-
The internal details of Zab or how it differs from Raft
-
The details of specific authentication protocols (AUTH_UNIX, RPCSEC_GSS, Kerberos integration)
-
The specific non-Unix operating systems to which NFS was ported