pk.org: CS 417/Lecture Notes

Coordination Services

Distributed Lock Management and Configuration

Paul Krzyzanowski – 2026-03-02

Previously, we looked at distributed mutual exclusion: how processes acquire and release locks on shared resources in a system without shared memory. The centralized algorithm, where one coordinator process grants and revokes locks, is simple and efficient. It requires only three messages per lock acquisition, and the coordinator has a complete view of who holds what. The problem is that the coordinator is a single point of failure. If it crashes, the entire system stalls waiting for locks that will never be granted.

The obvious solution is to replicate the coordinator. But now you have a new problem: if multiple coordinator replicas are running, they must all agree on who holds which locks. That agreement problem is consensus, which we covered last week.

A coordination service is a replicated coordinator made fault-tolerant through consensus. It is a small, highly available, strongly consistent store that distributed applications use to share information and coordinate operations. The use cases are narrow but critical: who is the current leader? What is the latest configuration? Is this lock currently held? Which servers are alive right now?

Getting the wrong answer to any of these questions can be catastrophic. If two nodes both think they are the leader, they will independently accept writes and the system state will diverge. If a node reads a stale configuration, it may route requests to a server that no longer exists. Strong consistency is not optional here, and strong consistency across failures requires consensus.

The coordination services we will study in this lecture, Chubby, ZooKeeper, and etcd, all use consensus under the hood for exactly this reason.


Google Chubby

Chubby was designed at Google and first described publicly in a 2006 paper by Mike Burrows. The Google File System needed a way to elect a primary master, Bigtable needed a way to coordinate tablet (pieces of a huge table) assignments, and MapReduce needed a way to elect a master job tracker. Each of these systems could have implemented its own ad-hoc coordination mechanism, but that would be fragile and hard to reason about. Chubby was built to give all of them a shared, well-engineered foundation.

The design goal was a highly available and persistent lock service and configuration store for large-scale distributed systems. Chubby was expected to be a dependency of nearly every major system at Google, which meant its failure would cascade broadly. High availability was therefore the top priority.

Architecture

A Chubby deployment is called a Chubby cell. By default, a cell consists of five servers called replicas. One replica is elected the master and serves all client requests. The other four are for fault tolerance: they participate in consensus to keep the replicated log consistent, but they do not serve the authoritative read/write workload. If a client contacts a non-master replica, it replies with the identity of the current master.

Paxos is the consensus algorithm used to replicate state across the five servers and to elect a new master when the current one fails. A majority of replicas (three out of five) must be alive for the cell to function. The ability to tolerate two simultaneous failures is a practical choice: in a large data center environment, two concurrent failures are not unusual, but three simultaneous failures are rare.

Chubby typically deploys one cell per data center. Clients within that data center contact their local cell. Every few hours, the entire cell database would also be backed up to GFS (Google File System) to protect against catastrophic loss of all replicas in the cell.

The File System Interface

Despite being described as a lock service, Chubby exposes a file system interface. Everything in Chubby is a named node in a hierarchical namespace of files and directories.

A lock is just a file, and its name is its hierarchical path. Any node may contain data, may contain children (think of it as a directory name that contains data as well), and may haeve a lock associated with it.

Using a file namespace avoided the need to build a separate naming scheme on top of the lock service and provided applications with a convenient place to store small amounts of associated data, such as the address of the current master.

The interface is not a standard POSIX file system. There is no kernel module; client software talks to the Chubby master via Google’s internal RPC system (Stubby, which later inspired the design of gRPC). File operations are intentionally limited. Files can only be read or written in their entirety: there are no byte-range reads or writes, and no seek operation.

When a client opens a file, it downloads the current contents and establishes a lease for that file. The server tracks which clients have cached copies and uses a write-through model: when a client writes a file, it sends the update to the master, which then sends cache invalidations to all other clients that have a cached copy. Combined with lease validity, this ensures that a client’s cached data is never stale as long as its lease is current: either the server has told it the data changed, or the lease has expired and the client must revalidate.

Locks

Locks in Chubby are advisory, not mandatory. A process can hold a lock and other processes can still access the underlying data, but well-behaved processes check for lock ownership before proceeding. Locks can be held in two modes: exclusive (one writer) and shared (multiple readers).

Chubby is designed for coarse-grained locking. A coarse-grained lock controls a large resource, such as an entire Bigtable table or a GFS master, and may be held for hours or days. This is very different from a fine-grained lock that might be held for milliseconds to protect a single row in a database. The distinction matters architecturally: a service optimized for coarse-grained locks can serve many more clients because lock operations are infrequent relative to the work being protected.

Events and Watches

Clients can subscribe to events for any open file or directory. Event types include: a file’s contents were modified, a new file or subdirectory was created, and a lock was acquired. This lets services avoid polling. Instead of checking every few seconds whether they are still the leader, a client can simply wait for a callback from Chubby telling it that the lock state has changed.

Leases

Chubby uses leases to manage the relationship between the master and clients. When a client acquires a lock or opens a file, it receives a time-bounded lease. The client must renew this lease periodically. If the client fails to renew before the lease expires, the server considers the client dead and revokes the lease.

Leases create a problem when the master itself fails. A new master is elected via Paxos and has access to the replicated state from the previous master, so it knows what sessions and locks existed. It goes through a recovery protocol: it broadcasts a new master epoch to clients, gives them a grace period to reconnect and re-establish their sessions, and only after that grace period begins serving new requests. Clients that fail to reconnect within the grace period have their sessions and locks released. This ensures a clean handoff without ambiguity about which lock grants are still valid.

Chubby as a Building Block

The most common pattern for using Chubby is leader election. If a group of processes wants to elect a leader, each process opens the same Chubby file and attempts to acquire an exclusive lock on it. Exactly one will succeed, and that process becomes the leader. The leader can store its address in the file so that other processes know how to contact it. When the leader fails, its lease expires, and the lock is released. Other processes can then compete for it again.

Because Chubby cells are small and serve thousands of clients, all data is stored in memory at the master. For durability, all writes are committed to disk and replicated across the replicas in the cell via Paxos before acknowledging success.


Apache ZooKeeper

ZooKeeper was developed at Yahoo! and contributed to the Apache Software Foundation as an open-source project in 2008. The Chubby paper had been published in 2006 and had significant influence, but Chubby was Google’s internal system and not available to anyone outside Google. Yahoo! was running large-scale systems of its own, including Hadoop and HBase, that needed the same kind of fault-tolerant coordination. ZooKeeper was built as an open-source coordination kernel to fill that role.

ZooKeeper does not simply clone Chubby’s interface. The most important architectural difference is that ZooKeeper does not provide locks as a primitive. Instead, it provides a minimal set of primitives from which locks, leader election, barriers, and other coordination patterns can be built. The philosophy is that a coordination kernel should be as small and general as possible, and that providing locks directly would force design choices onto applications that might not need them.

Data Model

ZooKeeper organizes data as a hierarchical tree of nodes called znodes. Each znode has a path (like a file system path), can hold a small amount of data (a few kilobytes), and can have children. There are two types of znodes that matter most for coordination:

Persistent znodes survive client disconnections. They remain until explicitly deleted.

Ephemeral znodes are automatically deleted when the client session that created them ends. This is the key mechanism for detecting failures: if a process creates an ephemeral znode to signal its presence and then crashes, the znode disappears. Other processes watching that znode are notified.

Either node may be created as a sequential znode. When creating a sequential znode, the system automatically appends a monotonically increasing integer to the name. This is essential for implementing distributed locks without thundering-herd problems. The sequential znode is not a different type of node but a creation-mode flag: PERSISTENT_SEQUENTIAL or EPHEMERAL_SEQUENTIAL.

Watches

A ZooKeeper watch is a one-shot notification mechanism. A client sets a watch when it asks ZooKeeper about a znode. Later, if the relevant state changes, ZooKeeper sends the client an event. After the watch triggers, it is removed and must be set again if the client wants continued monitoring.

You set watches via:

Watches are intentionally one-shot. ZooKeeper avoids keeping long-lived subscriptions and pushes complexity to the client: when a watch fires, the client typically re-reads the znode state and re-registers the watch. This pattern helps keep the client’s view consistent even if multiple changes occur quickly or while the client is temporarily disconnected, because the client treats the event as “something changed” and then refreshes state from ZooKeeper.

Consistency Model

ZooKeeper uses a variant of consensus called Zab (ZooKeeper Atomic Broadcast). Like Raft, Zab elects a leader and replicates writes through the leader. All writes go through the leader and are applied in order across all replicas. This gives ZooKeeper linearizable writes: every write completes in a globally consistent order.

Reads are different. By default, a client can read from any ZooKeeper replica, not just the leader. This means a read might return slightly stale data if the replica has not yet applied the latest writes. Reads are sequentially consistent: each client sees writes in order, but a follower may not yet have applied the latest writes from the leader.

Clients that need fresher data can issue a sync operation, which forces the server handling that client session to catch up to the leader’s committed state (as of when the sync is processed) before the read proceeds.

This is a practical tradeoff. Most coordination reads (checking a configuration value, watching for leader changes) can tolerate brief staleness. The rarer cases that need strict freshness pay the extra cost of a sync.

Building Locks with ZooKeeper

Since ZooKeeper provides no lock primitive, let’s see how you build one. The standard recipe is:

  1. To acquire the lock, create a sequential ephemeral znode under a lock directory, e.g., /locks/my-lock/lock-0000000042.

  2. List all children of /locks/my-lock. If your znode has the lowest sequence number, you hold the lock.

  3. If not, watch the znode with the next-lowest sequence number below yours. When it is deleted (because that client released the lock or crashed), re-evaluate.

The sequential znode ensures that locks are granted in arrival order. The watch on the predecessor rather than on all children prevents the thundering herd problem: when a lock is released, only one waiter is notified rather than all of them.

ZooKeeper and Chubby Compared

The key conceptual difference is design philosophy. Chubby is higher-level: it gives you locks, events, and a file store, all integrated. ZooKeeper is a toolkit: it gives you the minimal primitives needed to build those things yourself.

Both use consensus to replicate state. Both use leases. Both support watches and events. In practice, the coordination patterns you build on both systems look very similar. ZooKeeper’s main advantages over Chubby were that it was open-source, available outside Google, and designed from the start to be used as a general coordination primitive rather than a lock service with a configuration store attached.


etcd

etcd was created in 2013 by CoreOS[^footnote_core0s] as part of the infrastructure for their container-centric operating system. The immediate need was storing cluster configuration for CoreOS machines. ZooKeeper was available, but it requires a JVM, has a complex operational model, and its API dates from an era before RESTful services were ubiquitous. CoreOS wanted something simpler to deploy and operate, with an HTTP/JSON API that any language could talk to without a special client library.

etcd quickly became the authoritative store for Kubernetes cluster state. Every Kubernetes object, including pods, services, secrets, and configuration maps, is stored in etcd. If etcd fails, the Kubernetes control plane cannot function.

Architecture and Consistency

etcd uses the Raft consensus algorithm, which we covered in detail last week. Raft’s log-based replication maps cleanly onto etcd’s key-value model, and Raft’s emphasis on understandability made it easier to reason about correctness during development.

Like ZooKeeper, etcd provides linearizable reads by default. Unlike ZooKeeper, which routes reads to any replica by default, etcd routes all reads through the leader (or performs a quorum read) to guarantee linearizability without requiring a separate sync call. You can opt into stale reads from followers for better performance if your use case permits it.

A linearizable read is a read that is guaranteed to reflect the effect of the most recent completed write in the system, as if all operations occurred in a single real-time order. In other words, once a client sees a write succeed, any later read (by any client) must return that write or something newer, not an older value from a lagging replica.

Data Model

etcd stores a flat key-value map rather than an explicit directory tree. This differs from ZooKeeper and Chubby, where the namespace is hierarchical: in ZooKeeper, a parent znode must exist before a child can be created, and clients can list a node’s children; in Chubby, paths are explicitly modeled as files and directories.

In etcd, a key is an arbitrary byte string. Applications often choose path-like key names such as /config/... or /services/..., but that naming convention is not enforced by etcd: there is no parent object to create, and there are no “child” objects in the data model. Instead, etcd provides two building blocks that let you treat a prefix as if it were a directory: range queries over a key interval (typically all keys with a given prefix) and watches over that same range.

etcd also provides a watch API that is more capable than ZooKeeper’s. A watch in etcd can monitor a key or an entire key prefix, and it delivers a stream of change events rather than a single one-shot notification. This is more convenient for long-running watchers.

Leases

etcd supports leases with a mechanism very similar to ZooKeeper’s ephemeral znodes. A client creates a lease with a time-to-live (TTL), then associates keys with that lease. If the client stops renewing the lease (via heartbeats), all keys associated with the lease are automatically deleted. Services use this for presence detection: a healthy server maintains a lease and stores its address in etcd under that lease. If the server crashes, the lease expires and the key vanishes, and watchers are notified.

Transactions

etcd supports multi-key transactions with a compare-and-swap structure. A transaction specifies a set of conditions (e.g., the version of a key is what I expect), a set of operations to apply if the conditions hold, and a fallback set of operations if they do not. This is used to implement distributed locks and leader election without race conditions.

etcd and ZooKeeper Compared

etcd replaced ZooKeeper in most new infrastructure projects primarily for operational reasons, not correctness. Both are strongly consistent and use consensus internally. etcd uses Raft while ZooKeeper uses Zab, but both achieve equivalent safety and liveness guarantees.

The practical differences are about developer experience. etcd exposes a native HTTP/gRPC API that any language can talk to directly. ZooKeeper requires a dedicated client library and carries the operational overhead of the JVM. etcd’s watch API delivers a persistent stream of change events, while ZooKeeper’s watches are one-shot and must be re-registered after each notification. For teams building modern cloud infrastructure, etcd’s operational simplicity tipped the scales.


Common Coordination Patterns

Whether you use Chubby, ZooKeeper, or etcd, the coordination patterns built on top of them are the same. Here are the most important ones.

Leader Election

The scenario: you have N replicas of a service and exactly one must act as the primary at a time.

The shared idea across all these systems is that replicas contend for a well-known name in the coordination service. A replica becomes a leader only if it can acquire that name atomically, and its leadership remains valid only while it maintains a liveness condition (a session or a lease).

If the leader fails and its session or lease expires, the coordination service removes the leader’s claim, and the remaining replicas contend again. The exact mechanism varies: Chubby uses a lock in a file-system namespace, ZooKeeper often uses ephemeral sequential znodes with predecessor watches, and etcd uses a key created under a TTL lease via an atomic transaction.

Distributed Locks

A distributed lock grants one process at a time exclusive access to a shared resource. The coordination service provides the serialization point: acquiring the lock is a write that goes through consensus, so it is globally ordered. Locks built on ephemeral nodes or leases are self-cleaning: a crashed lock holder’s lease expires and the lock is released automatically.

Configuration Management

Services store their configuration as values in the coordination service. When configuration changes, the update goes through consensus and is applied consistently across all replicas of the service. Clients watch the configuration keys and are notified when values change. This replaces the old model of modifying config files on each server individually.

Service Discovery

A running service instance registers its address by writing to the coordination service under a known prefix (e.g., /services/payments/instance-7), typically using an ephemeral key with a lease. Clients discover available instances by listing that prefix. Because ephemeral keys are deleted when the server fails, the list in the coordination service is always an accurate view of what is currently alive.

Fencing Tokens

Fencing is a subtle but important software design pattern. Consider a leader that acquires a lock and then experiences a stop-the-world garbage collection (GC) pause for thirty seconds. During the pause, its lease expires, a new leader acquires the lock, and then the old leader wakes up and tries to write to a shared resource, thinking it still holds the lock.

The solution is a fencing token: a monotonically increasing number associated with each lock grant. Every time the lock is acquired (or re-acquired after a failure), the coordination service increments the token. The shared resource (e.g., a database, a storage server) is told to reject any request with a token lower than the highest it has seen. The old leader wakes up with a stale token and its writes are rejected. The new leader’s writes, with a higher token, succeed.

Raft uses a closely related idea internally. Each election increments a term (an epoch number), and servers reject requests from leaders with older terms. This prevents an old leader from continuing to act as the leader within the Raft cluster. A fencing token applies the same monotonic-number idea to resources outside the consensus group: the database or storage system rejects requests from an old leader, even if that leader still “thinks” it is in charge.

Fencing tokens are essential any time the leaseholder can be paused (by garbage collection, swap activity, or I/O delays) or partitioned from the network. A lock without a fencing mechanism provides only weak safety.


What Coordination Services Do Not Give You

It is worth being explicit about the limits of coordination services. They store small amounts of data and are not suitable for storing megabytes of application data.

They are built for small, coordination-oriented updates, not for heavy data ingestion. Coordination services are a good fit for writes such as “who is the leader,” “what configuration version is current,” or “which services are registered,” but they are a poor fit for logging, metrics, or any workload that involves a constant stream of large writes. For that reason, etcd guidance sizes a cluster based on expected request rate, database size, and latency goals, rather than giving one universal “writes per second” limit.

A good rule of thumb is, if the data is on the critical path of every client request, it does not belong in a coordination service. They are not a replacement for a database or a message queue.

More fundamentally, a coordination service gives you a place to agree on who is in charge, but it does not guarantee that the system as a whole behaves correctly. A leader that is elected by ZooKeeper can still have bugs, can still crash mid-operation, and can still leave the application state in an inconsistent condition. The coordination service serializes leadership decisions; your application must handle the rest.


The Crucial Role of Consensus

A theme that should be clear by now: every coordination service we have discussed, Chubby, ZooKeeper, and etcd, uses consensus at its core. This is not an accident. The problems coordination services solve are exactly the problems that require consensus: multiple independent processes must agree on a single authoritative answer, even when some processes or network links fail.

Without consensus, you cannot safely elect a leader: two processes could both believe they won. Without consensus, you cannot implement a lock: two processes could both believe they hold it. Without consensus, you cannot safely store configuration: different clients could read different values.

The cost of consensus is latency and complexity. Every write must go through the leader and be replicated to a quorum before the leader acknowledges success. This is why coordination services are used sparingly, for metadata and control-plane decisions, rather than for data-plane operations that require high throughput.

When you need agreement across failures, you pay the cost of consensus. When you can tolerate some inconsistency, you use something cheaper. The systems we will study in coming weeks will keep revisiting this tradeoff.

References


  1. CoreOS was a company and a family of container-focused Linux projects. CoreOS created etcd and Container Linux, and was acquired by Red Hat in 2018; its OS lineage continues as Fedora CoreOS and Red Hat Enterprise Linux CoreOS (used by OpenShift).