Where We Left Off
The previous class covered several mechanisms for coordinating distributed processes: mutual exclusion, leader election, and virtual synchrony. Before discussing consensus, it is useful to understand what each of those gives you and, more importantly, what they do not.
The mutual exclusion algorithms we studied – centralized, token ring, Lamport’s algorithm, and Ricart-Agrawala – all solve a narrow problem: ensuring that at most one process enters a critical section at a time. The centralized approach is simple but makes the coordinator a single point of failure. The distributed approaches (Lamport’s and Ricart-Agrawala) eliminate that single point of failure but require O(N) messages per entry and assume that all processes are reachable and that the set of participants is stable. If a process crashes mid-protocol, these algorithms stall. They have no mechanism for making progress in the presence of failures.
The leader election algorithms – the Bully algorithm and the Chang-Roberts ring election – give you a way to designate a single coordinator, but they have their own limitations. The Bully algorithm assumes a synchronous model: it relies on timeouts to declare a process dead, which means it can elect the wrong leader if a slow process is mistaken for a crashed one. The ring algorithm assumes a stable logical ring topology and offers no protection if a node fails mid-election. Neither algorithm provides any guarantee about what happens if the network partitions: each partition may independently elect its own leader.
Virtual synchrony addresses group membership and message ordering. It gives you a clean model for delivering messages consistently across a changing group. But it is a communication abstraction, not a decision-making protocol. It tells you which messages were delivered in which view, but it does not tell you how the group should agree on a specific value or decision when members disagree. It also does not solve what happens when the network partitions: each side of a partition can form its own view and continue making decisions independently. If both sides keep operating, the system diverges, and virtual synchrony provides no mechanism to prevent that or to reconcile the two histories when the partition heals.
The gap these mechanisms share is this: none of them handle the situation where the network splits into two or more isolated groups, each of which can still make progress independently. That is the problem consensus is designed to solve.
Split-Brain and Quorum
When a network partition occurs, one group of machines loses contact with the other. If your coordination protocol allows either side to independently elect a leader, accept writes, or make decisions, you get split-brain: two separate parts of the system simultaneously believe they are authoritative, each accepting operations the other does not see. When communication is restored, you can end up with two incompatible histories that must be reconciled or, in the worst case, one of them must be discarded.
The standard solution is to require a quorum before any decision can be made.
A quorum is a decision threshold, usually chosen so that any two quorums overlap. The most common choice is a majority quorum: more than half of the nodes. This guarantees overlap even if partitions change over time. Today’s “majority” might be a different set of machines tomorrow. Because any two majorities overlap, any future quorum must include at least one node that participated in the earlier quorum. That shared node carries the earlier decision forward (from stable storage) and prevents the system from later committing a conflicting decision.
With a quorum, it is impossible for two isolated groups to both independently form a quorum. If one side of a partition has a majority, the other side does not, and the side without a quorum cannot proceed. The minority side simply stalls until connectivity is restored.
For example, in a five-server cluster, a majority is any three servers. If the network splits into a group of three and a group of two, only the group of three can form a quorum. The group of two cannot elect a leader, cannot commit writes, and cannot make any binding decisions. The cluster as a whole remains consistent.
Requiring a quorum means that a cluster of n servers can tolerate at most ⌊(n−1)/2⌋ failures. A five-server cluster tolerates two failures; a three-server cluster tolerates one. If too many servers fail or are unreachable, the cluster stops making progress rather than risk inconsistency. This is a deliberate design choice: it is better to be unavailable than to be wrong.
The Problem We’re Solving
Distributed systems exist to avoid single points of failure. But the moment you have multiple servers cooperating on a task, you face a new problem: how do they agree on anything? If two servers can both accept writes, which one wins? If the network partitions, how do you avoid an inconsistent state developing among the partitions? If a server crashes mid-operation, how do you prevent a system ending up in an unknown intermediate state?
These questions all reduce to a single fundamental problem: consensus. At its core, consensus is the problem of getting a set of nodes to agree on a single value, even when some nodes fail or messages are delayed or lost.
Consensus shows up everywhere in distributed systems. You need it to elect a leader, to decide whether a transaction committed, to agree on the order of log entries in a replicated database. It is the mechanism that makes reliable distributed software possible.
What Does “Value” Mean?
Consensus is formally described as getting nodes to agree on a single value. The word “value” sounds abstract – like a number. In practice, a value is any decision that a distributed system needs to make. Examples:
-
“The next entry in the replicated log is
SET account_balance = 500.” -
“This transaction is committed.”
-
“Server 3 is the new leader for this epoch.”
-
“The new cluster configuration includes servers 1, 2, and 4.”
-
“Grant lock L to client C.”
The value agreed upon in a single round of consensus is whatever the system needs all participants to agree on at that moment.
The Goal: Replicated State Machines
Before diving into how consensus works, it helps to understand what we’re building it for.
A state machine is a deterministic system: given the same starting state and the same sequence of inputs, you always end up with the same result. A simple key-value store is a state machine. So is a database, a lock manager, or a configuration registry. Most of the programs we write are state machines. They store the same data and produce the same results when fed the same sequence of inputs.
We introduce replication in distributed systems for both scale (load balancing) and fault tolerance (having access to data even if some nodes die).
The idea behind state machine replication is that if you run the same state machine on multiple servers and guarantee that every server processes exactly the same commands in the same order, then all servers will eventually apply the same sequence of committed commands, so the committed state is consistent across replicas. If one crashes, another can take over without losing committed data.
The challenge is the ordering guarantee. In a distributed system with no shared clock and with unreliable networks, you cannot assume that commands arrive in the same order at every server. Consensus is the mechanism that imposes that order.
If clients send a command to any server, the system needs to behave like a single machine: a single total order of commands, applied once, producing a single result 1. Consensus on the sequence of commands gives us that total order.
The typical design is a replicated log. Each command from a client gets appended to a log. Conceptually, the system runs consensus repeatedly: one agreement per log index, producing a single, totally ordered sequence of commands. Every system builds a log identical to that of every other. Once a command is committed to the log, every server executes it in log order. The state machine’s output is deterministic, so all servers end up in the same state.
This architecture underlies etcd (a key-value store), ScyllaDB (a high-performance, low-latency distributed database), CockroachDB (a cloud-native, distributed SQL database), TiKV (another cloud-native distributed database), Facebook’s LogDevice, and many other systems. The beauty of it is that the application does not need to worry about replication at all – it just reads and writes to what looks like a single state machine. The consensus layer handles the distribution invisibly.
Defining Consensus
A consensus protocol must satisfy three properties.
Here, “non-faulty” means a process that has not crashed (it may be slow or temporarily unreachable due to the network, but it continues executing the protocol).
Agreement: All non-faulty processes decide on the same value.
Validity: The decided value must be a value that was proposed by some process. (You cannot just decide on an arbitrary default.)
Termination: All non-faulty processes eventually decide.
The first two are safety properties: they say that bad things cannot happen. The third is a liveness property: it says that good things eventually do happen. As you will see, the tension between safety and liveness is at the heart of why consensus is hard.
Model: We assume crash failures (a server can stop and later restart) and network faults (delay, loss, reordering, and partitions). We do not assume Byzantine behavior. Safety must hold under arbitrary network behavior. Liveness requires some eventual stability (a long-enough period where messages get through within a bounded delay so a leader can remain a leader).
Why Consensus Is Hard
Consensus sounds simple. Why not just have every server broadcast its preferred value, collect the responses, and pick the majority? The problem is failures. What happens if a server crashes after broadcasting its value but before you collect all responses? What if a network partition means some servers never receive certain messages? What if a “crashed” server was actually just slow, recovers, and now sends stale messages into a new round of voting?
The real-world consequences of getting this wrong are severe. Consider a distributed database that uses a naive agreement protocol. During a network partition, both sides of the partition elect their own leader and accept writes. When the connection is restored, you have two different versions of the truth. Resolving that conflict may require application-level intervention or, worse, data loss.
Real systems have experienced split-brain when leader election and quorum rules were misconfigured or implemented incorrectly. During a partition, different sides can believe they are the only live group and accept writes, and the system then needs rollbacks or reconciliation when the partition heals.2
Impossibility Results and the Limits of Coordination
Distributed systems rely on message passing, but message passing comes with uncertainty. A missing message might be lost, delayed indefinitely, or never sent because a process crashed, and those cases can look identical.
Two classic results capture the consequences. The Two Armies Problem shows why message loss prevents the common knowledge needed for guaranteed coordinated action. The FLP Impossibility Result shows why, in a purely asynchronous system with even one possible crash, no deterministic consensus protocol can guarantee both safety and termination.
The Two Armies Problem
Before diving into formal impossibility results, it is worth considering a simpler problem that illustrates why distributed agreement is fundamentally hard.
The Two Armies Problem asks whether two armies, camped on opposite sides of a valley, can coordinate a simultaneous attack on an enemy in the middle. The armies cannot see each other, can only communicate by sending messengers, and any messenger might be captured or delayed indefinitely.
Army 1 proposes an attack time and sends a messenger to Army 2. If Army 2 receives the message, it agrees and sends a confirmation back. But Army 1 cannot be sure the confirmation arrived. So it sends an acknowledgment of the confirmation. Now Army 2 cannot be sure that acknowledgment arrived. No matter how many messages are exchanged, there is always a “last message,” and the sender of that message can never know whether it was delivered.
If we require safety, namely that neither army will attack unless it is certain the other will attack, then no protocol can guarantee a coordinated attack over an unreliable channel. The obstacle is not a poor algorithm but that message loss prevents the parties from establishing common knowledge, the infinite “I know that you know that I know…” chain needed for coordinated action.
This might seem counterintuitive. After all, TCP is designed to provide reliable, in-order delivery. However, TCP cannot create common knowledge either. If a connection fails (or appears to fail due to timeouts), one side may not know whether the last message was delivered before the failure, and the other side may not know what the first side concluded. Reliable delivery of bytes does not eliminate the fundamental uncertainty at the boundary between “delivered” and “not delivered” when failures are possible.
The FLP Impossibility Result
In 1985, Michael Fischer, Nancy Lynch, and Michael Paterson published a result that changed how the computer science field thinks about consensus. The FLP Impossibility Result proves that in a purely asynchronous distributed system, no deterministic algorithm can guarantee consensus if even a single process might crash.
The key word is asynchronous. In an asynchronous system, there is no bound on how long a message might take to arrive. There is no timeout you can safely use to declare a process dead, because a slow process is indistinguishable from a crashed one.
The proof works by showing that for any consensus algorithm, you can always construct an execution where the system cannot safely decide. The key observation is that along any run there is a “critical” message delivery: a message such that, depending on whether it is delivered now or delayed, the execution can be extended to a run that decides 0 or a run that decides 1.
If that message is delayed, the other processes cannot tell whether the sender crashed or is simply slow. They also cannot tell whether receiving that message would have changed what a correct decision should be. If they decide without it, they cannot know that the decision is safe, because there is an indistinguishable execution in which receiving that message would have led (correctly) to the other value. If they wait for it, they may wait forever in executions where the sender really did crash, which violates termination. The algorithm either waits forever for the delayed message (violating termination) or decides without it, unable to know whether that decision is safe (risking agreement).
FLP does not say that consensus is impossible in practice. It says that in a purely asynchronous model with crash failures, no deterministic protocol can guarantee both:
-
Safety (agreement and validity), and
-
Termination (everyone eventually decides).
If you insist on always preserving safety, then there exist executions where the protocol cannot guarantee termination.
In practice, every real consensus protocol relaxes one of the constraints by adding some synchrony assumption. Paxos and Raft guarantee safety unconditionally but sacrifice liveness when the system is too unstable. They may block if the network keeps changing or no leader can stay a leader long enough. This is acceptable because real systems usually have periods of stability during which progress can be made.
Both the Two Armies Problem and the FLP Impossibility Result hinge on indistinguishability: when delay or loss is unbounded, you cannot tell whether the missing information is merely late or will never arrive, and that is enough to prevent guaranteed coordination (Two Armies) or guaranteed termination of consensus (FLP).
Paxos: The Foundation
History
Paxos was developed by Leslie Lamport in 1989 while working at DEC SRC3. The algorithm was described in a paper titled “The Part-Time Parliament,” named after a fictional Greek legislature on the island of Paxos whose members occasionally wandered in and out of the chamber. Lamport submitted the paper to ACM Transactions on Computer Systems in 1990, but it did not move forward in the usual way: the editors wanted a rewrite in a more conventional style, and reviewers complained that the narrative framing was too whimsical. Lamport did not want to strip out the allegory, so the manuscript sat in limbo for years. It was finally published, largely intact, in 1998, and went on to become one of the most influential (and challenging) papers in the history of distributed systems.
Lamport published a more direct presentation in 2001 titled “Paxos Made Simple,” which dropped the allegory and presented the algorithm in plain terms. The joke, by that point, was that even the “simple” version was notoriously difficult to implement correctly.
The Mechanism
Paxos has three roles: proposers initiate proposals, acceptors vote on them, and learners learn the chosen value4. In a real system, a single server typically plays all three roles.
The algorithm solves the problem of getting a set of acceptors to agree on a single value.
The protocol relies on a simple but powerful property: any two majorities of acceptors in a group of n share at least one member. That overlap matters over time: if one majority accepted something earlier, any later majority must include at least one of the same acceptors. Combined with Paxos’s promise and value-selection rules, this prevents two different values from being chosen for the same decision.
The algorithm runs in two phases.
Phase 1 (Prepare/Promise): A proposer selects a unique proposal number n and sends a Prepare(n) message to a majority of acceptors. An acceptor responds with a promise: it will not accept any proposal numbered less than n, and it reports the highest-numbered proposal it has already accepted, if any.
Phase 2 (Accept/Accepted): If the proposer receives promises from a majority, it sends an Accept(n, v) message. The value v is chosen as follows:
-
If any of the promises it received reported a previously accepted proposal, the proposer must use the value from the highest-numbered one.
-
If none of those promises reported a previously accepted proposal, the proposer is free to use any value it wants, typically the value it originally wanted to propose.
An acceptor accepts Accept(n, v) only if it has not since promised a higher proposal number; if it accepts, it records (n, v) and replies Accepted.
Once a majority of acceptors have accepted the same (n, v), the value v is chosen (decided), and learners can safely output it. Learners discover the decision by collecting Accepted messages or by querying acceptors.
The critical insight is the constraint on choosing v in Phase 2. By forcing a new proposer to adopt the value of any prior accepted proposal it hears about, Paxos ensures that if a value was already decided in a previous round, future rounds will decide the same value. Safety is maintained even if multiple proposers are active concurrently.
Multi-Paxos
Basic Paxos decides a single value. A replicated log requires deciding a sequence of values: one per log slot.
Multi-Paxos extends single-decree Paxos by running a separate instance of Phase 2 for each log slot, while amortizing Phase 1 by having a stable leader run it once and then skipping it for subsequent slots. This optimization is significant: without it, every log entry requires two round trips to a majority.
When a leader changes, the new leader must run Phase 1 again for any log slots it has not yet resolved, which is why leader instability is expensive in Paxos.
Multi-Paxos
Multi-Paxos
Basic Paxos decides a single value. A replicated log requires deciding a sequence of values, one per log slot (index).
Multi-Paxos does this by reusing a long-lived leader. The leader runs Phase 1 once, using a proposal number that covers all future log slots, to establish itself as the accepted proposer across the entire log.
After that, for each new log slot, the leader can usually skip Phase 1 and go straight to Phase 2, sending Accept(n, v) for the next slot and waiting for Accepted from a majority. That reduces the common case to one round trip per log entry instead of two.
When the leader changes, the optimization resets. A new leader must run Phase 1 to establish a new proposal number, and it must also recover any slots where the previous leader might have partially progressed. In practice, that means extra messages and extra waiting, which is why frequent leader changes make Multi-Paxos slow.
Challenges with Paxos
Paxos has a reputation for being difficult to implement correctly, and the reputation is earned.
The protocol as described in Lamport’s papers leaves many practical questions unresolved: How do you handle leader conflicts? How do you recover from partial failures during Phase 2? How do you reconfigure the set of acceptors (change cluster membership)? What happens when a leader crashes between phases? Each of these requires careful engineering, and the answers are not obvious from the algorithm description.
Chubby, Google’s distributed lock service, is one of the most prominent Paxos deployments. Google’s engineers published a 2007 paper, “Paxos Made Live,” describing what it actually took to build a production Paxos system. The paper documents a long list of practical complications that the original algorithm does not address, including disk corruption, Byzantine behaviors from broken hardware, and the subtleties of correctly implementing membership changes. Their conclusion was essentially that the gap between understanding Paxos and building a correct, production-grade Paxos implementation is enormous.
Other systems that use Paxos or Paxos variants include Apache ZooKeeper (which uses a closely related protocol called Zab), Google Spanner, and Microsoft Azure’s Service Fabric.
Raft: A More Understandable Consensus Algorithm
Motivation
By the mid-2000s, consensus had become a fundamental building block, but Paxos’s reputation for being hard to understand, teach, and implement correctly had created a real problem. Many systems built their own ad-hoc replication protocols that were subtly wrong. Even engineers who understood Paxos in principle struggled to implement it without bugs.
Diego Ongaro and John Ousterhout set out to design a consensus algorithm that was explicitly optimized for understandability. The result was Raft, published in 2014 in a USENIX ATC paper titled “In Search of an Understandable Consensus Algorithm.” The algorithm has since become the consensus mechanism of choice for new systems.
Raft is widely deployed in real systems, including etcd (the configuration store at the heart of Kubernetes), CockroachDB, TiKV (the storage layer of TiDB), Consul, InfluxDB, and YugabyteDB, among others. Many projects use “Raft-style” replicated logs even when they do not expose Raft directly.
Design Overview
Raft decomposes the consensus problem into three relatively independent subproblems:
-
Leader election: select one server to serve as leader at any time.
-
Log replication: the leader accepts log entries from clients and replicates them to followers.
-
Safety: ensure that no two servers disagree on which log entry occupies any given position.
The design decision to use a strong leader simplifies the protocol considerably. In Paxos, any node can propose a value, which creates complexity around conflicting proposals. In Raft, all writes go through the leader, which serializes decisions naturally.
Terms
Raft divides time into terms, each identified by a consecutive integer. A term begins with an election. If a candidate wins, it serves as leader for the rest of the term. If no candidate wins (a split vote), the term ends with no leader and a new term begins immediately.
Terms serve as a logical clock. Every message in Raft includes the sender’s current term. If a server receives a message with a higher term than its own, it updates its term and reverts to follower status. If it receives a message with a lower term, it rejects the message. This mechanism allows servers to detect stale messages from former leaders.
Server States
Every server is in exactly one of three states:
Follower: A follower is passive. It responds to RPCs from leaders and candidates but does not initiate client requests or log replication. All servers start as followers.
Candidate: A follower that has timed out waiting for a heartbeat from a leader. It initiates an election by requesting votes.
Leader: The elected coordinator. it handles all client requests, appends log entries, replicates them to followers, and sends periodic heartbeats to prevent new elections.
Follower --> Candidate --> Leader
^ | |
| v |
+---------- Follower <----+
Leader Election
Followers expect to hear from a leader periodically in the form of heartbeat messages (empty AppendEntries RPCs). Each follower maintains an election timeout – a random duration, typically in the range of 150 to 300 milliseconds. If a follower reaches its timeout without hearing from a leader, it assumes the leader has failed and starts an election.
To start an election, the follower:
-
Increments its current term.
-
Transitions to candidate state.
-
Votes for itself.
-
Sends a RequestVote RPC to every other server.
A RequestVote RPC includes the candidate’s current term, its identity, and information about the last entry in its log (the index and term of that entry).
A server grants its vote if:
-
It has not yet voted in the current term, and
-
The candidate’s log is at least as up-to-date as the voter’s own log.
“At least as up-to-date” is defined precisely: a log is more up-to-date than another if its last entry has a higher term, or if the terms are equal, the longer log wins. This restriction ensures that a candidate cannot win an election unless its log contains all committed entries.
A candidate wins the election if it receives votes from a majority of the cluster (including its own vote). It immediately sends heartbeats to all servers to establish its authority and prevent new elections.
If no candidate wins (for example, because the vote splits evenly), all candidates time out and start a new election with a new, higher term. The randomized election timeout makes it unlikely that multiple candidates will start elections simultaneously in the next round, allowing one to win.
Example: A Three-Server Election
Suppose servers S1, S2, and S3 are all followers at term 1 when the leader crashes.
S2 times out first (its random timeout was shortest). It increments to term 2, votes for itself, and sends RequestVote to S1 and S3.
S1 and S3 have not yet voted in term 2. Both respond with their vote for S2 (assuming S2’s log is at least as current as theirs).
S2 now has three votes (itself, S1, S3) out of three – a majority. S2 becomes the leader for term 2 and immediately sends heartbeats to S1 and S3.
S3, which was also about to time out, receives a heartbeat from S2 with term 2. Since term 2 is current, S3 recognizes S2 as the legitimate leader, resets its timeout, and stays a follower.
Log Replication
Once a leader is elected, it begins accepting client requests. Each request contains a command to be applied to the state machine. The leader:
-
Appends the command to its own log as a new entry, tagged with the current term and the next available index.
-
Sends AppendEntries RPCs to all followers in parallel, containing the new entry.
-
Once a majority of servers have acknowledged the entry (written it to their logs), the leader commits the entry.
-
The leader applies the entry to its state machine and returns the result to the client.
-
Future AppendEntries messages (including heartbeats) inform followers of the highest committed index, at which point followers apply the committed entries to their own state machines.
A log entry is identified by its index (position in the log) and term (the term number when the leader created the entry). These two identifiers together uniquely identify an entry.
Raft guarantees the Log Matching Property: if two logs have an entry with the same index and term, then the logs are identical in all entries up through that index. This is enforced by a consistency check in AppendEntries: along with the new entry, the leader includes the index and term of the immediately preceding log entry. A follower rejects the AppendEntries if its own log does not have a matching entry at that position. The leader then backs up and tries sending an earlier entry, repeating until it finds the point where the follower’s log agrees with its own. From that point, the leader overwrites any conflicting entries with its own.
Example: Log Replication
Index: 1 2 3 4
Leader: [T1] [T1] [T2] [T2] (all entries committed up to index 4)
S2: [T1] [T1] [T2] (missing index 4 -- will be replicated)
S3: [T1] [T1] [T2] [T2] (matches leader)
Suppose the leader receives a new command, creating entry at index 5. It sends AppendEntries to S2 and S3. S3 appends immediately. S2 also appends (its log is consistent through index 3, and index 4 is now included in this message). Once both S2 and S3 acknowledge, a majority (S2, S3, and the leader itself = 3 out of 3) have the entry at index 5, and the leader commits it.
Commit Rules and Leader Completeness
An entry is committed once the leader has stored it on a majority of servers. This is the basic rule, but there is an important subtlety.
Consider the scenario where a leader replicates an entry from an old term (for example, after a crash and re-election). Raft does not allow the new leader to commit entries from previous terms directly by counting replicas. Instead, the leader commits entries from previous terms indirectly: by appending a new entry from its current term and committing that. When the current-term entry is committed (by a majority), all preceding entries are also considered committed by the Log Matching Property.
This rule exists to prevent a subtle bug. Without it, a new leader could “commit” an entry from a prior term, only for a server with a more complete log to win a subsequent election and overwrite that entry. The restriction ensures that committed entries are never overwritten.
Safety and Liveness
Raft’s safety guarantee is the Leader Completeness Property: if a log entry is committed in a given term, it will be present in the logs of all leaders for all higher-numbered terms. This follows directly from the election restriction (candidates must have logs at least as up-to-date as any majority) combined with the commit rule (entries are only committed once stored on a majority).
Safety is unconditional in Raft (under the crash-failure model with persistent logs): the protocol never allows two different entries to be committed at the same log index. Progress, however, requires a majority of servers to be up and able to communicate.
Liveness – the guarantee that clients eventually get responses – is conditional. Raft requires a stable leader to make progress. If the network is so unstable that elections keep failing (for example, because every candidate’s RequestVote messages arrive just after another candidate has already won a vote), the system can stall. In practice, the randomized election timeout makes this scenario extremely rare.
Cluster Membership Changes
One practical issue that comes up in any real deployment is changing the set of servers in the cluster – adding a new server, removing a failed one, or replacing hardware. You cannot simply shut down the cluster, reconfigure, and restart; production systems need to change membership without downtime.
Raft handles this through a mechanism called joint consensus. During a configuration change, the cluster transitions through a joint configuration that includes both the old and new sets of servers. Decisions during this period require a majority of both the old and new configurations. Once the joint configuration is committed, the cluster transitions to the new configuration alone. This two-phase approach ensures that no two majorities – one from the old config and one from the new – can make independent decisions during the transition.
In production, teams often perform membership changes one server at a time (add one, wait for it to catch up, then remove one) to reduce risk. But safety still depends on using a reconfiguration mechanism that guarantees a majority intersection across the transition. Joint consensus is Raft’s standard way to ensure that property.
Log Compaction and Snapshots
As a server runs, its log grows indefinitely. Replaying a long log on restart would take an unacceptably long time, and storing it indefinitely is wasteful. Raft handles this through snapshots. Each server periodically captures the current state of its state machine and records the last log index and term included in the snapshot. It can then discard all earlier log entries.
When a follower falls too far behind (perhaps because it was offline for an extended period), the leader may not have the log entries the follower needs; they may have been compacted. In that case, the leader sends the follower its current snapshot directly, using an InstallSnapshot RPC. The follower replaces its state with the snapshot and resumes normal log replication from that point.
Raft vs. Paxos
The differences between Raft and Multi-Paxos are as much about clarity as about mechanism.
-
Both protocols rely on a stable leader and a majority quorum to commit entries.
-
Both tolerate the crash of up to f nodes in a cluster of 2f+1 nodes.
-
Both guarantee safety unconditionally and liveness only under a sufficiently stable network.
The key differences are:
-
Raft uses randomized timeouts for leader election, which is simpler than Paxos’s approach of competing proposers with proposal numbers. Raft’s election mechanism makes it clear which server is the leader at any moment.
-
Raft handles log replication and leader election as unified mechanisms: the leader’s log is always the ground truth, and elections are specifically designed to preserve that truth. In Paxos, the distinction between Phase 1 (establishing leadership for a log slot) and Phase 2 (replicating a value) is separated in a way that can obscure the full picture.
-
Raft explicitly specifies cluster membership changes and log compaction as part of the protocol. Paxos leaves these to implementors, which is part of why Paxos implementations vary so widely.
Further Reading
-
Fischer, Lynch, Paterson. “Impossibility of Distributed Consensus with One Faulty Process.” JACM, 1985.
-
Lamport, L. “Paxos Made Simple.” ACM SIGACT News, 2001. 😂
-
Chandra, Griesemer, Redstone. “Paxos Made Live: An Engineering Perspective.” PODC, 2007. This describes the experience of what it took to implement Paxos correctly.
-
Ongaro, Ousterhout. “In Search of an Understandable Consensus Algorithm.” USENIX ATC, 2014. This is really easy to read compared to the Paxos papers.
-
Ongaro, D. “Consensus: Bridging Theory and Practice” (PhD dissertation, Stanford, 2014) – the most complete reference for Raft.
-
Raft Visualization, thesecretlivesofdata.com – a step-by-step interactive visualization of how Raft works, covering leader election, log replication, and log matching..
Videos
-
Ousterhout, J. and Diego, O. “Implementing Replicated Logs with Paxos”, August 2013. – This is a clear discussion of how Paxos works, why it does what it does to address different failure modes, and how it’s used to build replicated logs.
-
Ousterhout, J. and Diego, O. “Raft: A Consensus Algorithm for Replicated Logs, August 2013. – A really clear and thorough discussion of how Raft works and tackles different failure cases. You should watch this.
-
Lamport, Leslie. The Paxos Algorithm or How to Win a Turing Award, October 2024, _This contains links to videos and slides of a lecture by Leslie Lamport explaining Paxos. It’s rigorous, a bit abstract, and uses precise formalisms. Watch this if you like theory. _
-
A solution such as a global sequence number generator doesn’t solve the problem since that the coordinator becomes a single point of failure, pushing the consensus problem to building replicas for the coordinator, which requires consensus. ↩
-
A widely discussed real-world example involved MongoDB replica sets where configuration choices and network partitions led to multiple nodes believing they were primary, followed by rollbacks when the partition healed. ↩
-
The Systems Research Center of the Digital Equipment Corporation, a research laboratory. DEC was a pioneer in minicomputers, eventually acquired by Compaq, which was later acquired by Hewlett-Packard. ↩
-
In the Paxos literature, a value that has been decided is often described as “chosen.” ↩