Virtual Synchrony

Paul Krzyzanowski

October 2016

Virtual Synchrony

Fault tolerance

We often deploy distributed systems in an attempt to achieve both high availability and high scalability. High availability refers to the fraction of time that the system is up and running, capable of doing whatever it is supposed to do. High scalability refers to the ability to expand the system to handle higher amounts of traffic. Both of these goals are often achieved via redundancy; adding replicated components to the system. Redundant components can step in when some components break, thus helping with availability. Active-passive replication means that the redundant components are generally not servicing requests but are ready to take over for a failed component. Active-active replication means that the workload is distributed among all components, thus helping with scalability (each component only handles a fraction of the workload) as well as availability.

Faults may be either fail-silent or Byzantine. With fail-silent faults, the failed system is effectively dead: it takes no input and produces no output. With Byzantine faults, the system continues to run but runs in a faulty manner, producing incorrect output. With a fail-silent system, the system can remain silent. This is called fail-stop operation. A process that exited or a computer that was shut off are examples of a fail-stop system. Alternatively, we may have an environment where the system comes back online. For example, the failed computer may restarted or the dead process may be relaunched. This is called fail-restart (or fail-recover) operation. The consideration with fail-restart, as we’ll shorty see, is that the recovered process may be unaware of what transpired in the overall system during the time it was dead.

Communication systems may be synchronous or asynchronous. A synchronous system is one where there is a known time limit to delivering, receiving, and processing a message. A USB link is an example of a synchronous system. An asynchronous system, on the other hand, is one where there is no time bound on when a message arrives and gets processed. For example, the IP network is an example of an asynchronous system. Arbitrary routes and queueing delays destroy any assurance of fixed time limits. Moreover, process scheduling provides another level of uncertainty,

In distributed systems, we generally assume that processes are concurrent, asynchronous, and failure prone. In this environment, we assume we have good processes. That is, they exhibit fail-silent behavior. If they are working, they are working correctly and produce good data. We use mechanisms such as error-detecting codes to detect possible network-based corruption and assume that we have fail-silent, asynchronous communications.

We have a fundamental problem in detecting failure in this environment. The two army problem provides an illustration.

Suppose there are two divisions, A and B, of allied forces that plan to attack the enemy. Both must attack in order to win; if only one division attacks it will be vanquished. The divisions are far apart and communication is via messenger (this story is set before radio).

Division A sends a messenger to B to say “let’s attack at dawn.” The communication line is faulty as the messenger may be killed or captured along the route. Therefore, the A’s general requests that the messenger return to ensure that the message was delivered (this is a message acknowledgement). The messenger does indeed return. Can the divisions attack with confidence that both Have the message? Not really.

The general of A knows that division B received the message. However. The general of B does not know if the messenger returned successfully to A. If the messenger didn’t make it back to A then A will not attack and B, fighting on its own, will lose. To fix this issue, B requests the messenger return back to B (acknowledging the acknowledgement). Now B knows that A received the acknowledgement. Are we good?

Unfortunately not. Now A does not know if the messenger made the trip back to B. If B does not receive the message that A got the acknowledgement, it will not attack. We can continue this process indefinitely, sending more and more levels of acknowledgements. Although, we may have a high level of confidence that the messages are being received, we will never know for sure.

The two-army problem demonstrates that it is impossible to achieve consensus with asynchronous faulty processes. We have no foolproof way of determining whether a process has truly failed or whether it is alive but either not communicating or not communicating quickly enough. This becomes a fact of life in distributed systems.

State machine replication

It is usually not sufficient to simply have extra hardware with installed software. We want the state of the software to be synchronized among all the replicas. For example, if a database field, file, or setting is modified on one system, it should be modified on all the replicas. This will ensure that client requests can be processed identically regardless of which server the client contacts.

The set of modifications on any system can be viewed as a state machine. A state machine refers to any program that takes inputs, produces outputs, and maintains internal state (which may affect the outputs it yields). A replicated state machine is a set of identical state machines (programs) that are running. On currently on several servers. If each one of them is fed the same set of inputs in the same sequence, each will update its internal data (e.g., file changes or variables) in the same way and produce the same outputs.

We refer to the set of replicated processes running on a set of servers as a process group. The process group enables load balancing as queries can be sent to any server. Modifications will, of course, have to be sent to all replicas but read operations far outnumber writes in many applications. The process group also enables fault tolerance since it is acceptable for some replicas to die as long as others are still running and can handle incoming requests.

For the process group to function correctly, it is important for the replicas to keep the same state. To do this, each of the processes in the process group needs to receive inputs in the same order. More precisely, causally-related messages should be processed in the same order. Ordering of messages among concurrent client processes should not affect the state of outputs produced by the replicas.

We achieve fault tolerance because we have replicated systems – all processing the same inputs and making the same set of changes. However, if one of the replica processes is dead, it will not receive any updates. When it restarts (for example, the system is reboots), its data (state) will not be up to date; the system will be in a state prior to the updates. Clearly, this is not good as the overall system is now in an inconsistent state.

The two-army problem and our environment of asynchronous, faulty processes gives us the situation where we cannot reliably detect a failed process. For instance, the process may be alive but it might be slow to respond and we may time out waiting for a response and assume it is dead. Even if the process is dead, it may recover (i.e., be restarted) at a later time and have inconsistent state.

We can deal with this problem by propagating with the entire group the knowledge that we think a process failed. If we suspect that a process is dead, we will simply take it out of the group. If it recovers (or was just slow to respond earlier), it will have to re-join the group.

Virtual synchrony

Virtual synchrony gives us a model for managing a group of replicated processes (aka state machines) and coordinating communication with that group. With virtual synchrony, a process within a group (one of the replicated servers) can join or leave a group – or be evicted from the group. Any process can send a messag to the group and the virtual synchrony software will implement an atomic multicast to the group. Recall that an atomic multicast is the strongest form of reliable delivery, ensuring that a message is either delivered to all processes in the group or to none. This all-or-nothing property is central to keeping all members of the group synchronized. Message ordering requirements can often be specified by the programmer but we nominally expect causal ordering of any messages that effect changes so that we can ensure consistency among the replicas.

Views

Virtual synchrony defines a group view as the set of processes that are currently in the group. These are live processes that are capable of communication. Each group multicast message is associated with a group view and every process in the system has the same group view. There will never be the case where one process will see a different set of group members than another.

Group membership and view changes
Group membership and view changes

Whenever a process joins or leaves a group – or is forced out of a group, the group view changes. This change information has to be propagated to all group members. After all, we need to ensure that everyone has the same group view. Group view information is shared with a view change message. This is a multicast message that announces a process joining or leaving a group.

Since we cannot reliably detect process failure, we rely on timeouts to assume that a process is dead. If any process times out waiting for a response from a process, a group membership change takes place and that non-responding process is removed from the group. Since it is no longer in the group, it will not receive any messages sent to the group.

Group membership events

Group members may receive any of three types of events:

  1. Regular message. This message is simply treated as an input to the program (the state machine), although its delivery to the application may be delayed based on message ordering policy and our ability to be sure that other group members have received the message.

  2. View change. A view change is a special message that informs the process of a group membership change. This will affect any multicasts that are generated by the process from this point forward. We will discuss the view change in more detail shortly.

  3. Checkpoint request. If a process joins a group, it needs to bring its state (internal state as well as stored data) up to date so that it contains the latest version and is synchronized with the other replicas. To do this, a process may send a checkpoint request message to any other process in the group. That process will send its current state to the requesting process.

View changes

A view change is a bit more complex than simply informing a process that there are more or fewer members in the group. Suppose we have a view change because a new process is joining the group. At the same time, we have some regular messages being multicast to the group. We cannot allow the condition where some messages may be delivered to the old group and some other messages may be delivered to the new group because that sender already received the view change message.

We need to guarantee atomicity: all or nothing behavior. A message m must be delivered to all processes in a a group view G before any of the processes in the group processes the view change message. Alternatively, a message m must not be delivered to any process in group view G. Reliable processes with this property are virtually synchronous. Any multicast must take place within a specific group view and cannot straddle two views. Hence, a view change acts as a barrier.

Recall the distinction between receiving a message and delivering the message to the application. A system cannot control when a message is received: that’s up the the whims of the network. However, the network stack or messaging API in the application can control when any message is presented, or delivered, to the application.

Isis toolkit

The Isis toolkit, created at Cornell University, is an example of a fault-tolerant distributed system that provides virtual synchrony. It is designed to provide high membership update rates and demonstrated an ability to handle hundreds of thousands of events per second on commodity hardware back in 2009.

Isis provides distributed consistency. Applications can create and join groups and send multicasts to group members. All applications see the same events in an equivalent order (“equivalent order” generally means a causal order, although that can be configured in the toolkit API). Group members can also update their group state (i.e., process view change requests in a consistent and fault tolerant manner.

Isis has been used by the New York Stock Exchange, the Swiss Exchange, the U.S. Navy Aegis Weapon System, and many other environments. Systems that are conceptually similar to Isis include Microsoft’s scalable cluster service, IBM’s Distribution and Consistency Services (DCS), CORBA (Common Object Request Broker Architecture; a feature-rich RPC framework), and Apache Zookeeper (a configuration, synchronization, and naming service).

Isis implementation

The Isis toolkit is designed to work with commodity hardware and commodity networks. Message transmission is assumed to be asynchronous (for example, IP). Messages may also be received in a different order than they were sent.

To achieve virtual synchrony, the toolkit must preserve the illusion that events take place in the same order at all replicas. Isis uses TCP to achieve reliable point-to-point delivery. Multicasting is implemented through multiple unicasts: a sender sends a distinct message to each group member. Even though TCP provides reliable delivery thanks to retransmitting lost or damaged packets, we still do not have assure that all group members will receive a multicast message since there is a chance that the sender may fail before the multicast send has completed.

Group Membership Service

A central component of the Isis toolkit is the Group Membership Service (GMS). This is a network service that maintains the systemwide view of group members. Whenever any process p suspects a failed process q, it reports it to the GMS. The GMS reports the view change to every process that had a connection with q and removes q from the process group. If q is alive (or restarts), it will need to rejoin the group. The GMS ensures that the system has a consistent view of group memberships and propagates that information to all members.

State transfer and view change

When a new or restarted member joins a group, it will need to update itself to have the current state of the system. It does this by sending a checkpoint request message to any existing member of the group. This initiates a state transfer where the group member sends a dump of its state to the new member.

A state transfer is part of a group-wide view change. Even though the state transfer may take some time to complete, it — as well as the overall view change — has to be treated as an instantaneous event by the system. We have to ensure that any messages not yet delivered to any non-faulty processes get delivered before the view change is complete since those earlier messages were sent to members in the previous group. More formally, we have to guarantee that all messages sent to view Gi are delivered to all non-faulty processes in G{i} before the view change to Gi+1.

Ensuring message delivery

A process will loop through its list of group members and send a message to each member reliably using TCP. However, these received messages cannot be delivered to the application unless the process is certain that every group member has received the message. In cases where the sending process does not fail (most of the time!), a subsequent message demonstrates that the previous message has been successfully transmitted to all recipients (this is a form of piggyback acknowledgements, where we avoid sending a separate acknowledgement message bur rather include the acknowledgement in the next message that we transmit). At that time, a process can deliver the received message so it can be processed by the application. If a process has a received message and knows that all group members have received the message, that message is considered to be stable. Until that time, the received message is considered to be unstable. Stable messages can be delivered to applications; unstable messages cannot be delivered.

If a process died either partway through multicasting a message (so that only some group members received it) or before it was able to inform the group members that the message was successfully received by every group member, we may have unstable messages sitting in the holdback queue at some processes in the group.

During a view change, we have to handle this situation and figure out how to turn unstable messages into stable messages. When a process receives a view change message, it will multicast all unstable messages to the entire group (the group prior to the new group defined by the view change). Effectively, each process takes over the delivery from the original, failed, sender. Note that we may have a flurry of activity with several processes sending identical messages to all group members. Each message needs to be uniquely identified (e.g., a sequence number) and each process must discard duplicate messages.

When each process is done transmitting its unstable messages, it sends a flush message to each group member and waits for an acknowledgement. An acknowledgement means that the receiver has delivered all messages to the application. The view change is complete when the flush message from each group member has been acknowledged. At this time we know that there are no undelivered messages that were sent during the previous group view. Any messages sent from this point on will be sent to the new group.

Summary

Virtual synchrony provides a highly efficient way to send group messages with atomic delivery, ensuring that all group members are consistently replicated. It is not a transactional system, which would require resource locking and one-message-at-a-time processing. Message ordering policies can be configured in the framework and are generally causal within a view, thus ensuring that related events are consistently ordered at all replicas. A view change acts as a barrier so that all messages that were sent in an earlier view will be delivered within that view.

A group membership service (GMS) provides a consistent view of group membership. If any process suspect a failed process (e.g., because of a timeout), it informs the GMS. If any process wants to join a group, it contacts the GMS service as well. Whenever the group membership changes, the GMS initiates a view change.

Every process sends messages to all group members via a reliable multicast (using TCP and looping through a list of group members). Each successive message indicates that the previous message has been received by all group members. A message that has been received by all group members is considered stable and can be delivered to the application. If a sending process died partway through a multicast, any messages that it has sent are unstable. During a view change, each process sends unstable messages to all group members and waits for acknowledgements. Any messages that a process receives that are not duplicates are considered stable and are delivered to the application. Finally, each process sends a flush message to the group. A group member acknowledges the flush when it has delivered all messages to the application. When all flushes are acknowledged, the view change is complete.