Consensus and Replicated State Machines

Distributed systems use multiple machines to avoid single points of failure. The cost of that redundancy is the need to agree on what the system is doing. The mechanisms covered earlier (mutual exclusion, leader election, and virtual synchrony) each solve a piece of the coordination problem, but none of them solves the harder one: how a group of servers should agree on a single decision when servers can crash, messages can be lost, and the network can split. That is the problem of consensus, and it is the foundation on which fault-tolerant distributed systems are built.

Limits of the Earlier Coordination Mechanisms

The mutual exclusion algorithms (centralized, token ring, Lamport’s algorithm, and Ricart-Agrawala) all solve a narrow problem: ensuring that at most one process enters a critical section at a time.

The centralized approach makes the coordinator a single point of failure.
The distributed approaches (Lamport’s and Ricart-Agrawala) eliminate that single point of failure but require O(N) messages per entry and assume every process is reachable and the set of participants is stable. If a process crashes mid-protocol, these algorithms stall. They have no way to make progress in the presence of failures.

The leader election algorithms (the Bully algorithm and the Chang-Roberts ring election) give you a way to designate a single coordinator, but they have their own limitations.

The Bully algorithm assumes a synchronous model: it relies on timeouts to declare a process dead, which means it can elect the wrong leader if a slow process is mistaken for a crashed one.
The ring algorithm assumes a stable logical ring topology and offers no protection if a node fails mid-election.

Neither algorithm provides any guarantee about what happens if the network partitions (splits so that parts of the network cannot communicate with each other): each partition may independently elect its own leader.

Virtual synchrony addresses group membership and message ordering. It gives you a clean model for delivering messages consistently across a changing group, but it is a communication abstraction, not a decision-making protocol. It tells you which messages were delivered in which view, but not how the group should agree on a specific value when members disagree. It also has no answer for partitions: each side of a partition can form its own view and continue making decisions, and when the partition heals, the histories diverge with no way to reconcile them.

None of these mechanisms handles the situation where the network splits into two or more isolated groups, each of which can still make progress independently.

Split-Brain and Quorum

When a network partition occurs, one group of machines loses contact with the other. If your coordination protocol allows either side to independently elect a leader, accept writes, or make decisions, you get split-brain: two separate parts of the system simultaneously believe they are authoritative, each with its own leader and each accepting operations the other does not see. When communication is restored, you can end up with two incompatible histories that must be reconciled or, in the worst case, one of them must be discarded.

The standard solution is to require a quorum before any decision can be made.

A quorum is a decision threshold, usually chosen so that any two quorums overlap. The most common choice is a majority quorum: more than half of the nodes. This guarantees overlap even if partitions change over time. Today’s “majority” might be a different set of machines tomorrow. Because any two majorities overlap, any future quorum must include at least one node that participated in the earlier quorum. That shared node carries the earlier decision forward (from stable storage) and prevents the system from later committing a conflicting decision.

With a quorum, it is impossible for two isolated groups to both independently form a quorum. If one side of a partition has a majority, the other side does not, and the side without a quorum cannot proceed. The minority side stalls until connectivity is restored.

For example, in a five-server cluster, a majority is any three servers. If the network splits into a group of three and a group of two, only the group of three can form a quorum. The group of two cannot elect a leader, commit writes, or make any binding decisions. The cluster as a whole remains consistent.

Requiring a quorum means that a cluster of n servers can tolerate at most ⌊(n−1)/2⌋ failures. A five-server cluster tolerates two failures; a three-server cluster tolerates one. If too many servers fail or are unreachable, the cluster stops making progress rather than risk inconsistency. This is a deliberate design choice: it is better to be unavailable than to be wrong.

What Does “Value” Mean?

Consensus is the problem of getting a set of nodes to agree on a single value, even when some nodes fail or messages are delayed or lost. The word “value” sounds abstract, as if it refers to a number. A value is whatever data a distributed system needs to agree on at that moment. Each of these is a value:

“The next entry in the replicated log is SET account_balance = 500.”
“This transaction is committed.”
“Server 3 is the new leader for this epoch.”
“The new cluster configuration includes servers 1, 2, and 4.”
“Grant lock L to client C.”

The value agreed upon in a single round of consensus is whatever the system needs all participants to agree on at that moment.

Replicated State Machines

Consensus is rarely an end in itself. It is a building block for replicated state machines, which is what most fault-tolerant distributed systems are designed to provide.

A state machine is a deterministic system: given the same starting state and the same sequence of inputs, it always ends up in the same final state. A key-value store is a state machine. So is a database, a lock manager, or a configuration registry. Most of the programs we write are state machines: they store data and produce results that depend only on the inputs they have processed and the order in which they processed them.

We introduce replication in distributed systems for two reasons: scale (spreading load across servers) and fault tolerance (keeping the data available even if some nodes die).

The idea behind state machine replication is that if you run the same state machine on multiple servers and guarantee that every server processes the same commands in the same order, then all servers apply the same sequence of committed commands and reach the same state. If one crashes, another can take over without losing committed data.

The challenge is the ordering guarantee. In a distributed system with no shared clock and with unreliable networks, you cannot assume that commands arrive in the same order at every server. Consensus is the mechanism that imposes that order.

If clients send a command to any server, the system needs to behave like a single machine: a single total order of commands, applied once, producing a single result ¹. Consensus on the sequence of commands gives us that total order.

The typical design is a replicated log. Each client command is appended to a log. Conceptually, the system runs consensus repeatedly: one agreement per log index, producing a single, totally ordered sequence of commands. Every server builds a log identical to every other server’s log. Once a command is committed to the log, every server executes it in log order. The state machine is deterministic, so every server ends up in the same state.

This architecture underlies etcd (a key-value store used by Kubernetes), ScyllaDB (a high-throughput distributed database), CockroachDB (a cloud-native distributed SQL database), TiKV (another cloud-native distributed database), Facebook’s LogDevice, and many others. The application does not need to know about replication at all: it reads and writes to what appears to be a single state machine, and the consensus layer handles distribution invisibly.

Defining Consensus

A consensus protocol must satisfy three properties.

Here, “non-faulty” means a process that has not crashed (it may be slow or temporarily unreachable due to the network, but it continues executing the protocol).

Agreement: All non-faulty processes decide on the same value.

Validity: The decided value must be a value that was proposed by some process. (You cannot just decide on an arbitrary default value.)

Termination: All non-faulty processes eventually decide.

The first two are safety properties: they say that bad things cannot happen. The third is a liveness property: it says that good things eventually do happen. The tension between safety and liveness is what makes consensus hard.

Model: We assume crash failures (a server can stop and later restart) and network faults (delay, loss, reordering, and partitions). We do not assume Byzantine behavior. Safety must hold under arbitrary network behavior. Liveness requires some eventual stability: a long-enough period during which messages arrive within a bounded delay, so that a leader can remain a leader.

Why Consensus Is Hard

A first attempt at consensus might look like this: have every server broadcast its preferred value, collect the responses, and pick the majority. The approach is fine on a network that never loses messages and where every server stays up. It falls apart as soon as failures enter the picture.

A server can crash after broadcasting its value but before you collect all responses. A network partition can cause some servers to never receive certain messages. A server that was thought to be crashed can come back, having only been slow, and inject stale messages into a new round of voting. Any of these makes “collect all responses” a guarantee you cannot give.

The consequences of getting this wrong are severe. Consider a distributed database with a naive agreement protocol. During a network partition, both sides elect their own leader and accept writes. When connectivity is restored, the system holds two different versions of the truth. Resolving the conflict requires application-level intervention or, in the worst case, discarding committed writes.

Real systems have hit exactly this. Split-brain has occurred in production deployments when leader election and quorum rules were misconfigured or implemented incorrectly. During a partition, different sides each believe they are the only live group and accept writes, and the system later needs rollbacks or reconciliation when the partition heals.²

Impossibility Results and the Limits of Coordination

Distributed systems rely on message passing, and message passing introduces uncertainty. A missing message might be lost, delayed, or never sent because a process crashed, and those cases can look identical to anyone waiting for it.

Two classic results capture the consequences.

The Two Armies Problem shows why message loss prevents the common knowledge needed for guaranteed coordinated action.
The FLP Impossibility Result shows why, in a purely asynchronous system with even one possible crash, no deterministic consensus protocol can guarantee both safety and termination.

The Two Armies Problem

The Two Armies Problem asks whether two armies, camped on opposite sides of a valley, can coordinate a simultaneous attack on an enemy in the middle. The armies cannot see each other, can only communicate by sending messengers, and any messenger might be captured or delayed indefinitely.

Army 1 proposes an attack time and sends a messenger to Army 2. Army 1 needs to know that the message reached Army 2, so it requires a response (an acknowledgment).

If Army 2 receives the message, it agrees and sends a confirmation back. But it needs to be sure the messenger makes it back to Army 1 successfully.

So, Army 1 sends an acknowledgment of the confirmation: the messenger runs back to Army 2. Now Army 1 cannot be sure that the acknowledgment arrived. No matter how many messages are exchanged, there is always a “last message,” and the sender of that message can never know whether it was delivered.

If we require safety, namely that neither army will attack unless it is certain the other will attack, then no protocol can guarantee a coordinated attack over an unreliable channel. The obstacle is not a poor algorithm. Message loss prevents the parties from establishing common knowledge: the infinite “I know that you know that I know…” chain that coordinated action requires.

This might seem counterintuitive. TCP is designed to provide reliable, in-order delivery, so why doesn’t TCP solve it? Because TCP cannot create common knowledge either. If a connection fails, or appears to fail due to a timeout, one side may not know whether the last message was delivered before the failure, and the other side may not know what the first side concluded. Reliable delivery of bytes does not eliminate the underlying uncertainty at the boundary between “delivered” and “not delivered” when failures are possible.

The FLP Impossibility Result

In 1985, Fischer, Lynch, and Paterson published a result that reshaped how the field thinks about consensus.

The FLP Impossibility Result proves that in a purely asynchronous distributed system, no deterministic algorithm can guarantee consensus if even a single process might crash.

The key word is asynchronous. In an asynchronous system, there is no bound on how long a message might take to arrive. There is no timeout you can safely use to declare a process dead, because a slow process is indistinguishable from a crashed one.

The proof works by showing that for any consensus algorithm, you can construct an execution where the system cannot safely decide. Along any run, there is a “critical” message delivery: a message such that, depending on whether it is delivered now or delayed, the execution can be extended to a run that decides 0 or a run that decides 1.

If that message is delayed, the other processes cannot tell whether the sender crashed or is merely slow, and they cannot tell whether receiving the message would have changed what a correct decision should be. If they decide without it, they cannot know the decision is safe: there is an indistinguishable execution in which the missing message would have led (correctly) to the opposite value. If they wait, they may wait forever in executions where the sender did crash, which violates termination.

FLP does not say consensus is impossible in real systems. It says that in a purely asynchronous model with crash failures, no deterministic protocol can guarantee both:

Safety (agreement and validity), and
Termination (everyone eventually decides).

If you insist on always preserving safety, there exist executions where the protocol cannot guarantee termination.

Every real consensus protocol relaxes one of these constraints by adding some synchrony assumption. Paxos and Raft guarantee safety unconditionally but sacrifice liveness when the system is too unstable. They may block if the network keeps changing or no leader can stay a leader long enough. That is an acceptable compromise: real networks usually settle down for long enough stretches that progress can be made.

The Two-Armies Problem and FLP both reduce to indistinguishability. When delay or loss is unbounded, you cannot tell whether the missing information is merely late or will never arrive, and that uncertainty is enough to prevent guaranteed coordination.

Paxos: The Foundation

History

Paxos was developed by Leslie Lamport in 1989 while at DEC SRC³. The algorithm appeared in a paper titled “The Part-Time Parliament,” framed as a description of a fictional Greek legislature on the island of Paxos whose members occasionally wandered in and out of the chamber. Lamport submitted the paper to ACM Transactions on Computer Systems in 1990, but the editors wanted a rewrite in a more conventional style, and reviewers complained that the narrative framing was too whimsical. Lamport did not want to strip out the allegory, so the manuscript sat in limbo for years. It was published, largely intact, in 1998, and went on to become one of the most influential (and most challenging) papers in the history of distributed systems.

Lamport published a more direct presentation in 2001 titled “Paxos Made Simple,” which dropped the allegory and presented the algorithm in plain terms. The running joke is that even the “simple” version is notoriously difficult to implement correctly.

The Mechanism

Paxos has three roles: proposers initiate proposals, acceptors vote on them, and learners learn the chosen value⁴. In a real system, a single server typically plays all three roles.

The algorithm solves the problem of getting a set of acceptors to agree on a single value.

The protocol rests on one structural property: any two majorities of acceptors in a group of n share at least one member. That overlap is what makes the protocol safe across rounds. If one majority accepted something earlier, any later majority must include at least one of the same acceptors. Combined with Paxos’s promise and value-selection rules, this prevents two different values from being chosen for the same decision.

The algorithm runs in two phases.

Phase 1 (Prepare/Promise): A proposer selects a unique proposal number n and sends a Prepare(n) message to a majority of acceptors. An acceptor responds with a promise: it will not accept any proposal numbered less than n, and it reports the highest-numbered proposal it has already accepted, if any.

Phase 2 (Accept/Accepted): If the proposer receives promises from a majority, it sends an Accept(n, v) message. The value v is chosen as follows:

If any of the promises reported a previously accepted proposal, the proposer must use the value from the highest-numbered one.
If none of the promises reported a previously accepted proposal, the proposer is free to use any value it wants, typically the value it originally intended to propose.

An acceptor accepts Accept(n, v) only if it has not since promised a higher proposal number. If it accepts, it records (n, v) and replies Accepted.

Once a majority of acceptors have accepted the same (n, v), the value v is chosen (decided), and learners can safely output it. Learners discover the decision by collecting Accepted messages or by querying acceptors.

The constraint on choosing v in Phase 2 is what makes Paxos safe. By forcing a new proposer to adopt the value of any earlier accepted proposal it hears about, Paxos ensures that if a value was already decided in a previous round, future rounds will decide the same value. Safety holds even when multiple proposers are active concurrently.

Multi-Paxos

Basic Paxos decides a single value. A replicated log needs to decide a sequence of values, one per log slot, so the system runs a separate instance of consensus for every log index. Done naively, that costs two round trips per log entry: one for Phase 1 and one for Phase 2. For a system processing thousands of log entries per second, that overhead is unacceptable.

Multi-Paxos addresses this by reusing a long-lived leader across many log slots. The leader runs Phase 1 once, with a proposal number that covers all future slots, establishing itself as the accepted proposer across the log. After that, for each new client command, the leader can skip Phase 1 and go directly to Phase 2: send Accept(n, v) for the next slot, wait for Accepted from a majority, and commit. That reduces the common case to one round trip per log entry.

When the leader changes, the optimization resets. The new leader must run Phase 1 to establish a new proposal number, and it must also recover any slots where the previous leader might have partially progressed. That means extra messages and extra waiting, which is why frequent leader changes are expensive in Multi-Paxos.

The bigger point is that a replicated log only becomes practical with the Multi-Paxos optimization. Treating every log entry as an independent instance of basic Paxos is what the algorithm formally does, but no production system can afford it.

Challenges with Paxos

Paxos has a reputation for being difficult to implement correctly, and the reputation is earned.

The protocol as described in Lamport’s papers leaves many practical questions unresolved. How do you handle leader conflicts? How do you recover from partial failures during Phase 2? How do you reconfigure the set of acceptors? What happens when a leader crashes between phases? Each of these requires careful engineering, and the answers are not obvious from the algorithm description.

Chubby, Google’s distributed lock service, is one of the most prominent Paxos deployments. Google’s engineers published a 2007 paper, “Paxos Made Live,” describing what it took to build a production Paxos system. The paper documents a long list of complications that the original algorithm does not address, including disk corruption, Byzantine behavior from broken hardware, and the subtleties of correctly implementing membership changes. The gap between understanding Paxos and building a correct, production-grade Paxos system is enormous.

Other systems that use Paxos or Paxos variants include Apache ZooKeeper (which uses a closely related protocol called Zab), Google Spanner, and Microsoft Azure’s Service Fabric.

Raft: A More Understandable Consensus Algorithm

Motivation

By the mid-2000s, consensus had become a foundational building block, but Paxos’s reputation for being hard to understand, teach, and implement correctly had created a real problem. Many systems built their own ad hoc replication protocols that were subtly incorrect. Even engineers who understood Paxos in principle struggled to implement it without bugs.

Diego Ongaro and John Ousterhout at Stanford University set out to design a consensus algorithm explicitly optimized for understandability. The result was Raft, published in 2014 in a USENIX ATC paper titled “In Search of an Understandable Consensus Algorithm.” Raft has since become the consensus mechanism of choice for new systems.

The algorithm is widely deployed: etcd (the configuration store at the heart of Kubernetes), CockroachDB, TiKV (the storage layer of TiDB), Consul, InfluxDB, and YugabyteDB all use Raft. Many other projects use “Raft-style” replicated logs even when they do not expose Raft directly.

Design Overview

Raft decomposes the consensus problem into three relatively independent subproblems:

Leader election: select one server to serve as leader at any time.
Log replication: the leader accepts log entries from clients and replicates them to followers.
Safety: ensure that no two servers disagree on which log entry occupies any given position.

Raft uses a strong leader. In Paxos, any node can propose a value, which creates complexity around conflicting proposals. In Raft, all writes go through the leader, which serializes decisions naturally and removes a large class of corner cases that Paxos has to handle.

Terms

Raft divides time into terms, each identified by a consecutive integer. A term begins with an election. If a candidate wins, it serves as leader for the rest of the term. If no candidate wins (a split vote), the term ends with no leader and a new term begins immediately.

Terms serve as a logical clock. Every message in Raft includes the sender’s current term. If a server receives a message with a higher term than its own, it updates its term and reverts to follower status. If it receives a message with a lower term, it rejects the message. This mechanism allows servers to detect stale messages from former leaders.

Server States

Every server is in exactly one of three states:

Follower: A follower is passive. It responds to RPCs from leaders and candidates but does not initiate client requests or log replication. All servers start as followers.

Candidate: A follower that has timed out waiting for a heartbeat from a leader. It initiates an election by requesting votes.

Leader: The elected coordinator. It handles all client requests, appends log entries, replicates them to followers, and sends periodic heartbeats to prevent new elections.

Follower --> Candidate --> Leader
    ^              |          |
    |              v          |
    +---------- Follower <----+

Leader Election

Followers expect to hear from a leader periodically in the form of heartbeat messages (empty AppendEntries RPCs). Each follower maintains an election timeout: a random duration, typically in the range of 150 to 300 milliseconds. If a follower reaches its timeout without hearing from a leader, it assumes the leader has failed and starts an election.

To start an election, the follower:

Increments its current term.
Transitions to candidate state.
Votes for itself.
Sends a RequestVote RPC to every other server.

A RequestVote RPC includes the candidate’s current term, its identity, and information about the last entry in its log (the index and term of that entry).

A server grants its vote if:

It has not yet voted in the current term, and
The candidate’s log is at least as up-to-date as the voter’s own log.

“At least as up-to-date” is defined precisely: a log is more up-to-date than another if its last entry has a higher term, or if the terms are equal, the longer log wins. This restriction ensures that a candidate cannot win an election unless its log contains all committed entries.

A candidate wins the election if it receives votes from a majority of the cluster (including its own vote). It immediately sends heartbeats to all servers to establish its authority and prevent new elections.

If no candidate wins, for example, because the vote splits evenly, all candidates time out and start a new election with a new, higher term. The randomized election timeout makes it unlikely that multiple candidates will start elections simultaneously in the next round, so one will eventually win.

Example: A Three-Server Election

Suppose servers S1, S2, and S3 are all followers at term 1 when the leader crashes.

S2 times out first (its random timeout was shortest). It increments to term 2, votes for itself, and sends RequestVote to S1 and S3.

S1 and S3 have not yet voted in term 2. Both respond with their vote for S2 (assuming S2’s log is at least as current as theirs).

S2 now has three votes (itself, S1, S3) out of three: a majority. S2 becomes the leader for term 2 and immediately sends heartbeats to S1 and S3.

S3, which was also about to time out, receives a heartbeat from S2 with term 2. Since term 2 is current, S3 recognizes S2 as the legitimate leader, resets its timeout, and stays a follower.

Log Replication

Once a leader is elected, it begins accepting client requests. Each request contains a command to be applied to the state machine. The leader:

Appends the command to its own log as a new entry, tagged with the current term and the next available index.
Sends AppendEntries RPCs to all followers in parallel, containing the new entry.
Once a majority of servers have acknowledged the entry (written it to their logs), the leader commits the entry.
The leader applies the entry to its state machine and returns the result to the client.
Future AppendEntries messages (including heartbeats) inform followers of the highest committed index, at which point followers apply the committed entries to their own state machines.

A log entry is identified by its index (position in the log) and term (the term number when the leader created the entry). These two identifiers together uniquely identify an entry.

Raft guarantees the Log Matching Property: if two logs have an entry with the same index and term, then the logs are identical in all entries up through that index. This is enforced by a consistency check in AppendEntries: along with the new entry, the leader includes the index and term of the immediately preceding log entry. A follower rejects the AppendEntries if its own log does not have a matching entry at that position. The leader then backs up and retries with an earlier entry, repeating until it finds the point where the follower’s log agrees with its own. From that point on, the leader overwrites any conflicting entries with its own.

Example: Log Replication

Index:    1    2    3    4
Leader:  [T1] [T1] [T2] [T2]   (all entries committed up to index 4)
S2:      [T1] [T1] [T2]        (missing index 4: will be replicated)
S3:      [T1] [T1] [T2] [T2]   (matches leader)

Suppose the leader receives a new command, creating entry at index 5. It sends AppendEntries to S2 and S3. S3 appends immediately. S2 also appends (its log is consistent through index 3, and index 4 is now included in this message). Once both S2 and S3 acknowledge, a majority (S2, S3, and the leader itself = 3 out of 3) have the entry at index 5, and the leader commits it.

Commit Rules and Leader Completeness

An entry is committed once the leader has stored it on a majority of servers. That is the basic rule, but there is a subtlety worth spelling out.

Consider a scenario where a new leader replicates an entry from an old term, for example after a crash and re-election. Raft does not allow the new leader to commit entries from previous terms directly by counting replicas. Instead, the leader commits old entries indirectly: it appends a new entry from its current term and commits that one. When the current-term entry is committed by a majority, all preceding entries are committed along with it under the Log Matching Property.

The rule exists to prevent a specific bug. Without it, a new leader could “commit” an entry from a prior term, only for a server with a more complete log to win a subsequent election and overwrite that entry. The restriction ensures that committed entries are never overwritten.

Safety and Liveness

Raft’s safety guarantee is the Leader Completeness Property: if a log entry is committed in a given term, it appears in the logs of all leaders for all higher-numbered terms. This follows from the election restriction (candidates must have logs at least as up-to-date as any majority) combined with the commit rule (entries are only committed once stored on a majority).

Safety is unconditional in Raft, under the crash-failure model with persistent logs. The protocol never allows two different entries to be committed at the same log index. Progress, however, requires a majority of servers to be up and able to communicate.

Liveness, the guarantee that clients eventually get responses, is conditional. Raft requires a stable leader to make progress. If the network is so unstable that elections keep failing, for example because every candidate’s RequestVote messages arrive just after another candidate has already won a vote, the system can stall. The randomized election timeout makes this scenario rare.

Cluster Membership Changes

Real deployments need to change the set of servers in a cluster: adding a new server, removing a failed one, or replacing hardware. You cannot just shut down, reconfigure, and restart, because production systems are expected to change membership without downtime.

Raft handles this through a mechanism called joint consensus. During a configuration change, the cluster transitions through a joint configuration that includes both the old and new sets of servers. Decisions during this transition period require a majority of both the old and new configurations. Once the joint configuration is committed, the cluster transitions to the new configuration alone. The two-phase approach ensures that no two majorities, one from the old configuration and one from the new, can make independent decisions during the transition.

Teams often perform membership changes one server at a time (add one, wait for it to catch up, then remove one) to reduce risk, but safety still depends on a reconfiguration mechanism that guarantees a majority intersection across the transition. Joint consensus is how Raft provides that property.

Log Compaction and Snapshots

A server’s log grows without bound as it runs. Replaying a long log on restart would take an unacceptable amount of time, and storing it forever is wasteful. Raft addresses this with snapshots. Each server periodically captures the current state of its state machine and records the last log index and term included in the snapshot. It can then discard all log entries up to that point.

When a follower falls too far behind, perhaps because it was offline for an extended period, the leader may no longer have the log entries the follower needs; those entries may have been compacted. In that case, the leader sends the follower its current snapshot directly via an InstallSnapshot RPC. The follower replaces its state with the snapshot and resumes normal log replication from there.

Raft vs. Paxos

Raft and Multi-Paxos provide the same safety guarantees, but they differ in how they expose the algorithm to anyone trying to understand or implement it.

Both protocols rely on a stable leader and a majority quorum to commit entries.
Both tolerate the crash of up to f nodes in a cluster of 2f+1 nodes.
Both guarantee safety unconditionally and liveness only under a sufficiently stable network.

The differences are:

Raft uses randomized timeouts for leader election, which avoids the need for Paxos’s competing proposers with monotonically increasing proposal numbers. There is always a clear answer to “who is the leader right now?” in Raft.
Raft handles log replication and leader election as unified mechanisms. The leader’s log is always the ground truth, and elections are designed specifically to preserve that truth. In Paxos, Phase 1 (establishing leadership for a log slot) and Phase 2 (replicating a value) are separated in a way that can obscure how the full system fits together.
Raft specifies cluster membership changes and log compaction as part of the protocol. Paxos leaves these to the implementor, which is why Paxos implementations differ so widely from one another.

Videos

Ousterhout, J. and Ongaro, D. “Implementing Replicated Logs with Paxos”, August 2013. A clear discussion of how Paxos works, why it behaves the way it does under different failure modes, and how it is used to build replicated logs.
Ousterhout, J. and Ongaro, D. “Raft: A Consensus Algorithm for Replicated Logs”, August 2013. A clear, thorough discussion of how Raft works and how it handles different failure cases. Watch this.
Lamport, Leslie. “The Paxos Algorithm or How to Win a Turing Award”, October 2024. Links to videos and slides of a lecture by Leslie Lamport explaining Paxos. Rigorous, somewhat abstract, and built on precise formalisms. Worth watching if you like the theory side.

Next: Consensus Study Guide

Back to CS 417 Documents

A global sequence number generator does not solve the problem because the coordinator becomes a single point of failure, pushing the consensus problem to the replicas of the coordinator, which itself requires consensus. ↩
A widely discussed example involved MongoDB replica sets, where configuration choices and network partitions led to multiple nodes believing they were primary, followed by rollbacks when the partition healed. ↩
The Systems Research Center of the Digital Equipment Corporation, a research laboratory. DEC was a pioneer in minicomputers, eventually acquired by Compaq, which was later acquired by Hewlett-Packard. ↩
In the Paxos literature, a value that has been decided is often described as “chosen.” ↩