Group Communication, Mutual Exclusion, and Leader Election

Message Handling Fundamentals

Sending: The act of transmitting a message from an application through the communication layer to the network.
Receiving: The act of a machine accepting a message from the network; the message has arrived but is not yet visible to the application.
Delivering: The act of passing a received message to the application; this is when the application actually sees and processes the message.
Holdback queue: A buffer where received messages are held when they cannot be delivered immediately, such as when waiting for earlier messages to maintain ordering guarantees.
Multicast: One-to-many communication where a single message is delivered to a specific group of processes.
Unicast: One-to-one communication where a message is sent to a single recipient.
Broadcast: One-to-all communication where a message is sent to every process on the network.

IP Multicast

IP Multicast: Network-layer multicast using UDP, IGMP, and PIM; works well in controlled environments but is blocked by most ISPs on the public internet.
Internet Group Management Protocol (IGMP): Protocol operating between hosts and local routers that allows hosts to join and leave multicast groups dynamically through membership reports.
Protocol Independent Multicast (PIM): Multicast routing protocol that distributes traffic between routers using the existing unicast routing table for reverse path forwarding.
Reverse Path Forwarding (RPF): Technique where routers accept multicast traffic only if it arrives on the interface used to reach the source, preventing loops.
PIM Dense Mode (PIM-DM): PIM mode using flood-and-prune approach; floods multicast traffic everywhere, then prunes branches where there are no receivers. Appropriate when most subnets have receivers.
PIM Sparse Mode (PIM-SM): PIM mode requiring explicit joins to a Rendezvous Point; routers send Join messages toward the RP to build a shared distribution tree. Appropriate when receivers are sparsely distributed.
Rendezvous Point (RP): In PIM Sparse Mode, a designated router that serves as a meeting point where sources send traffic and receivers join to receive it.
Prune message: In PIM Dense Mode, a message sent upstream by routers with no interested receivers to stop receiving multicast traffic for a group.
Join message: In PIM Sparse Mode, a message sent toward the Rendezvous Point to join a multicast group and build the distribution tree.

Multicast Reliability Levels

Unreliable multicast: Best-effort delivery with no guarantees; messages may be lost, duplicated, or delivered to only some recipients.
Best-effort reliable multicast: Multicast guaranteeing delivery to all live recipients if the sender completes without crashing; does not handle sender failures during transmission.
Reliable multicast: Multicast guaranteeing agreement (if any correct process delivers, all correct processes eventually do), integrity (at most once, only if sent), and validity (sender delivers to itself).
Agreement (multicast property): Guarantee that if any correct process delivers a message, all correct processes eventually deliver it.
Integrity (multicast property): Guarantee that messages are delivered at most once and are identical to what was sent.
Validity (multicast property): Guarantee that if a correct process multicasts a message, it will eventually deliver that message to itself.
Durable multicast: Reliable multicast with persistence; messages written to stable storage before acknowledgment, surviving crashes and restarts.
Publish-subscribe (pub/sub): Communication pattern where publishers send to named topics and subscribers register interest in topics; decouples senders from receivers. The topic acts as a named multicast group.

Multicast Ordering Levels

Unordered delivery: No guarantees about message sequence; messages may arrive in any order at different recipients.
Single source FIFO ordering (SSF): Guarantee that messages from the same sender are delivered in the order they were sent. Formally: if a process sends m then m′, every process that delivers m′ will have already delivered m. Implemented using per-sender sequence numbers.
Causal ordering: Guarantee that if message m1 happened-before message m2, then m1 is delivered before m2 at all processes; implies single source FIFO ordering.
Vector timestamp: A vector clock attached to a message, used to implement causal ordering; the receiver buffers messages until all causally preceding messages have been delivered.
Total ordering: Guarantee that all processes deliver all messages in the same order; does not imply causal or single source FIFO ordering.
Agreement property: The key property of reliable multicast: if any correct process delivers a message, then all correct processes eventually deliver it. Provides “all or nothing” semantics.
Sequencer: A designated process that assigns global sequence numbers to achieve total ordering in multicast; single point of failure and potential bottleneck.
Atomic multicast: Reliable multicast with total ordering; also called atomic broadcast or ABCAST. Equivalent in power to consensus.
Synchronous ordering: A barrier primitive (sync) that blocks until all in-flight messages have been delivered, creating logical groups or epochs of messages with clean boundaries between them.
Sync primitive: A barrier operation that blocks until all previously sent messages have been delivered to all recipients; used to create well-defined message epochs, particularly for view changes.
Real-time ordering: Hypothetical ordering where messages would be delivered in actual physical time order; impossible to implement perfectly due to clock synchronization limits.

Failure Detection

Failure detector: A distributed oracle that provides information about which processes have crashed; imperfect in asynchronous systems.
FLP impossibility: The result by Fischer, Lynch, and Paterson proving that consensus cannot be guaranteed in asynchronous systems where even one process might crash.
False positive: An error where a failure detector incorrectly suspects a live process has crashed.
False negative: An error where a failure detector fails to detect that a process has crashed.
Heartbeat: A periodic message sent by a process to indicate it is alive.
Push-based heartbeating: Failure detection where monitored processes send heartbeats to monitors.
Pull-based heartbeating: Failure detection where monitors periodically query (ping) processes and expect responses.
Phi accrual failure detector: A failure detector that learns normal heartbeat timing patterns and outputs a continuous suspicion level (φ) on a logarithmic scale, where φ = k means roughly 10^(−k) probability the delay is normal variation.

Group Membership and Virtual Synchrony

Group membership service (GMS): A layer within each process that monitors other members using failure detection, participates in view change protocols, and notifies the application when membership changes.
View: A snapshot of group membership containing a unique identifier (typically a monotonically increasing number) and a list of member processes; all processes in a view agree on its membership.
Stable message: A message that has been received by all current group members. Stability is confirmed when the sender receives acknowledgments from all members. Only stable messages can be delivered to applications.
Message stability: The property that a message has been received by all group members. Essential for view changes: only stable messages are delivered before transitioning to a new view.
View change: A protocol that transitions all group members from one view to another when membership changes, ensuring agreement on which messages were delivered in the old view.
Flush message: In the view change protocol, a message exchanged by processes containing message IDs or stability summaries to ensure consistency before transitioning to a new view.
View leader (coordinator): A designated member that drives the view change protocol; not a single point of failure since a new leader is elected if the current one fails.
Virtual synchrony: A model developed by Ken Birman that makes group membership changes appear to happen synchronously with message delivery, even in asynchronous systems.
View synchrony: The guarantee that if a message is delivered in some view, it is delivered in that same view at all processes that deliver it.
ISIS: A distributed programming toolkit developed at Cornell in the 1980s that introduced virtual synchrony; used in production at NYSE, Swiss Stock Exchange, French ATC, and US Navy AEGIS.
GBCAST: The barrier primitive in ISIS used to coordinate group membership changes, ensuring all messages from the old view are delivered before transitioning to a new view.

Distributed Mutual Exclusion

Distributed mutual exclusion: Ensuring that at most one process is in a critical section at any time in a distributed system without shared memory.
Critical section: A code region that must be executed by at most one process at a time.
Safety (mutual exclusion): The property that at most one process is in the critical section at any time.
Liveness (mutual exclusion): The property that if a process requests the critical section and no process holds it forever, the requester eventually enters.
Fairness (mutual exclusion): The property that there exists a bound on the number of times other processes may enter the critical section before a waiting process is granted access.
Centralized mutual exclusion: A coordinator-based approach requiring 3 messages per entry (request, grant, release); simple but coordinator is single point of failure.
Lamport’s mutual exclusion algorithm: A distributed algorithm using Lamport timestamps to order requests; each process maintains a request queue and enters when its request is first and all acknowledgments received. Requires 3(N−1) messages.
Ricart-Agrawala algorithm: An optimization of Lamport’s algorithm that eliminates release messages by deferring replies to lower-priority requesters until exiting the critical section. Requires 2(N−1) messages.
Token ring mutual exclusion: An algorithm where a token circulates among processes in a logical ring; only the token holder may enter the critical section. Provides bounded waiting but requires token recovery if lost.

Leader Election

Leader election: The process of selecting a single coordinator from a group of distributed processes.
Coordinator: A designated process responsible for sequencing operations, making decisions, or managing a shared resource.
Bully algorithm: A leader election algorithm where the process with the highest ID becomes coordinator; uses ELECTION, OK, and COORDINATOR messages. Assumes synchronous model with timeouts. Worst case O(n²) messages.
ELECTION message: A message sent to higher-ID processes to initiate a new leader election.
OK message: In the bully algorithm, a response indicating that a higher-ID process is alive and will take over the election.
COORDINATOR message: A message announcing the winner of a leader election to all processes.
Ring election algorithm: A leader election algorithm where an election message circulates around a logical ring, collecting the highest process ID; the process receiving its own ID wins. Also called Chang-Roberts algorithm.
Chang-Roberts algorithm: The ring-based election algorithm where election messages circulate clockwise; each process forwards larger IDs or substitutes its own. Worst case 3N−1 messages.

Group Communication, Mutual Exclusion, and Leader Election

Message Handling Fundamentals

IP Multicast

Multicast Reliability Levels

Multicast Ordering Levels

Failure Detection

Group Membership and Virtual Synchrony

Distributed Mutual Exclusion

Leader Election

Back to CS 417 Documents