Exam 1 study guide

The one-hour study guide for exam 1

Paul Krzyzanowski

Latest update: Tue Oct 16 22:36:14 EDT 2018

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.

Introduction

Go here for lecture notes

Why are distributed systems more interesting now than they may have been one or two dozen years ago? Several advances in various areas of computing technology had a profound effect on the value and design of distributed systems. Since computer networking went mass market in the 1980s, local area network speeds increased by a factor of a thousand and wide-area (Internet) speeds by even more. Connectivity within a local area network (LAN) moved from shared to switched networking, allowing the network to scale without increasing congestion. On the wide area, Internet access has become available to the population at large, not just to researchers on Department of Defense projects. In 1965, Intel co-founder Gordon Moore predicted that the number of transistors on integrated circuits, and hence processor performance, would double approximately every two years. This prediction, known as Moore’s Law has turned out to hold (approximately) true for the past fifty years. Processor performance, system memory, and disk capacity has also increased by more than a thousandfold over the past couple of decades.

Even though these improvements make it easier to store, process, and move data between systems, what are the motivations for distributing computing? There are several. Performance does not scale linearly with an increase in price with a single computer; with a collection of computers this scaling may be possible. Secondly, distributing systems makes sense in certain environments: databases may reside in different locations than the user, for example. Thirdly, we may need to interact with other people or remote data that are geographically dispersed. Metcalfe’s “Law” states that the value of a telecommunications network is proportional to the square of the number of connected users of the system. As with Moore’s Law, this isn’t a real law, of course. The “square” part comes from the fact that the number of edges in a fully-connected graph is proportional to the square of the number of vertices. A vertex represents a person and an edge represents the communication path from one person to another. Simply put, there’s a lot of value in being able to communicate with a lot of people. Without it, services such as Skype, Google, Twitter, eBay, Instagram, Facebook, and countless others would not be nearly as useful.

Taxonomy

One way of classifying system architectures is via Flynn’s taxonomy, proposed by Michael J. Flynn way back in 1966. He categorized computers based on the number of concurrent instruction streams and the number of data streams.

SISD
(single instruction stream, single data stream) refers to conventional single processor systems.
SIMD
(single instruction stream, multiple data streams) refers to single processor computers where each instruction may process a collection of data. Vector and array processors fall into this category. SIMD includes graphics processors, cell processors, Intel’s Streaming SIMD Extensions (SSE4) in Intel’s Core microarchitecture and AMD’s K10 family of processors, Intel’s Advanced Vector Extensions, (AVX), the PowerPC’s AltiVec instructions, and the ARM® NEONTM SIMD engine.
MISD
(multiple instruction streams, single data stream) does not really make a lot of sense since it implies that multiple processors all process the same data. The term has occasionally been used to refer to used to replicated fault-tolerant systems.
MIMD
(multiple instruction streams, multiple data streams) refer to any computers with multiple processors, where each processor operates on its own stream of data. This category covers both parallel (multiprocessor) and distributed systems.

MIMD can be further categorized by identifying whether the system has shared memory or not. Systems with shared memory are known as multiprocessor systems. Examples are conventional PCs with multiple processors on a single system bus or multi-core systems.

An architecture where multiple identical processors communicate with a single shared memory is called a multiprocessor. Systems without shared memory are collections of separate computers, each with its own memory. They have to rely on a network to communicate and are sometimes referred to as networked computers or multicomputers.

Multiprocessor systems are characterized by three features: (1) the processors all share the same memory, (2) they all share the same clock, and (3) they exhibit an all-or-nothing property to system failure. What this last item means is that if the system is dead, none of the processors are working. With a multicomputer system, it is certainly possible to have one computer functioning while another is not.

The most common architecture for a multiprocessor system is symmetric multiprocessing, or SMP. In this kind of system, all processors are connected to the same shared memory and run the same operating system. No one processor has faster or prioritized access to memory or system peripherals than any other.

Multicomputers

When we switch focus away from multiprocessors to multicomputers, the fundamental difference is that the processing elements no longer have access to the same shared memory system. Since computers need to communicate, an alternate communication mechanism is needed. This mechanism is the network interconnect. As with multiprocessors, the network interconnect may be bus-based or switched. A bus-based interconnect means that all systems are connected to the same communications bus and can see all the traffic on the network. The original design of the Ethernet was bus-based. Today, it is generally not seen on local area networks as switches are cost-effective. A switched interconnect allows any pair of computers to communicate without affecting the bandwidth of other systems: it provides scalable bandwidth. Ethernet switches simulate some the behavior of a bus-based network to allow things like network broadcasts and multicasts to work.

Fault tolerance

While we do not expect our personal computers to fail at any given time, failure is a fact of life in distributed systems. It’s simply a matter of statistics: if you have a collection of thousands of systems, it is very likely that on any given day something goes wrong: a computer or disk dies, a switch goes bad, a network cable is unplugged, or something loses power. Should our local computer die, it is an all-or-nothing failure: the entire system is dead. In distributed systems, one system may fail while others continue to work. This is a partial failure. Partial failures can be insidious. If one computer sends a message to another and does not get a reply, what happened? It could be any or all of several things: the system is slow to respond, the receiving process failed, the entire receiving computer failed, a network connection was broken, or a routing issue incurred huge delays.

Handling failure is one of the central themes of distributed system design. We need to be able to handle detection, recovery, and restart.

Detection
Identify the cause of the failure
Recovery
Services - the distributed algorithms - need to work around the failure and continue to function properly. This might involve starting a service on a different system, electing a new coordinator, or stopping any attempts at communicating with the failed system.
Restart
At some point, the failed element may be brought back into the distributed system. It may be moments later, when a network cable is reinserted, or a far longer time if, for example, a failed processor needs to be replaced. Regardless of the elapsed time. the restarted system needs to reintegrate itself into the whole system. It may have missed messages and its knowledge of the world is outdated. For example, it may have hosted a replicated object store and missed getting information about new, updated, and deleted objects.

Fault tolerance needs to address both availability and reliability. Availability refers to the fraction of time that the system as a whole is usable. Since individual systems may fail, we tend to achieve high availability via redundancy: deploying duplicate, triplicate (and more) systems. The design of redundant systems has to consider all of the three aforementioned points. Failure to properly address the restart of a system may create consistency problems, where one replicated server returns different data than another one. Reliability deals with the integrity of data. We can have systems that appear to be functioning well but transmit garbled data. Or we might have malicious interference where an intruder is sending messages to confuse the systems. We will can address message integrity with error detection but will also need to address issues of message authentication.

Failure can manifest itself in different ways:

Fail-stop
With fail-stop failure, the failed component simply stops functioning. Ideally, it will be able to detect its own failure and notify other members of the system first but this is most often not possible. Halting refers to the explicit case where a component stops without any notice. We can try to detect failed components by sending messages over a network and setting timeouts for a response. Unfortunately, this is not foolproof because network latency is variable and the response might arrive after our timeout. Moreover, we can have problems with network connectivity between some hosts.
Fail-restart
Fail-restart is when a component restarts after a failure. The restart may be nearly instantaneous, so other systems didn’t notice, or it may be after a long interval. As we discussed earlier, the danger is stale state. The restarted component may have missed messages and hence has a view of the world that is obsolete.
Omission
Omission failure deals with networking. It is the failure to send or receive messages. This can be due to data corruption, queue overflows in routers, or overflows in the receive buffer in the operating system. An omission failure may cause a query or its response to get dropped, resulting in one system assuming that another one has failed.
Timing
With asynchronous networks such as IP, messages may take longer to arrive than we might expect. This can lead us to assume that a system is not responding and hence not functioning when it acturally is operating. Another problem that is based on timing is that each system has its own clock and hence its own concept of time of day. This can create undesirable behavior with process coordination, message ordering, and system logs.
Partition
A network of computers may be working but a link between two groups of systems may fail. For example, an Ethernet switch connecting two racks may fail or a cable may be disconnected. In this case, the network effectively fragments into two or more sub-networks that cannot communicate with each other. Each group of systems thinks the other group is dead.
Byzantine
Byzantine failures cover any failures where a component does not cease to function but instead produces faulty data. This can be due to bad hardware, software logic errors, network problems or it can be due to malicious interference. To a large extent, we will address byzantine failures on the network with the use of cryptography.

Regardless of the type of failure, a basic goal in distributed systems is to design a system that avoids a single point of failure. This is the case where one component is crucial to the functioning of the entire system. For example, we might have one process on one system that serves as a coordinator to dispatch and check on computation taking place on thousands of other systems (or keeps track where various blocks of data live in a distributed storage system). A failure of the coordinator effectively causes the entire system to fail.

Global state

In a distributed environment, it helps helps for one process to know what other systems are doing. For instance, a process may need to know the currently active members of a group of processes that hold replicated data. A problem with distributed systems design is that nobody has the true global state of a system. Because we lack shared memory, we cannot instantaneously see the data or liveness of other processes. Any data that changes over the execution of a program is referred to as state. For example, state may be lists of live processes, group members, contents of a database, computation progress, lists of processes that have remote files open, etc.

A process obviously knows its own state. It can periodically report its state to other processes via network messages and it may receive updates from other processes over the network. However, these updates are neither continuous nor instantaneous. A process will only know the last reported state of other processes, which may not be equivalent to the current state of those processes.

One type of state is not just information about group membership or the status of processes but the data stored by network file systems, databases, and object stores. This data is shared among systems that are designed to act as replicas – redundant systems that will give us instantaneous backups in case one system dies or allow us to load balance requests among multiple systems. Here we also have to deal with the fact that not all processes will be updated instantaneously.

A restricted form of replication is a cache. A cache is simply local storage of frequency-accessed data to reduce access latency. Instead of making a network request, a process can have a stored copy of the results. For example, a process may store the result set of common database queries. Caching can do a wonderful job in improving performance but also poses the risk of stale data – cached copies of data that are no longer valid since the original data has been modified.

Software

One goal in some distributed systems software is providing a single-system image. This is software that makes a collection of independent computers appear as a single system to the users. This involves hiding the fact that system or computation is distributed. The software should “just work.” A few areas when we want to provide transparency include:

Location
The user should not be aware of where the software is actually running or where resources reside.
Migration
The user should not be aware of the fact that the location of resources may have moved from one place to another or that a process was restarted on another computer, possibly in another data center.
Replication
The user should not be aware that data might be replicated for fault tolerance or for proximity in order to provide faster access.
Concurrency
The user should not be aware that multiple processes might be accessing resources at approximately the same time. Results should appear as if all the processes ran one after another in some order. This means that some processes might be temporarily locked from being able to access a set of resources while another process is using them.

Service models

In software design, we often turn to layered architectures, where we break up application functionality into multiple layers of abstraction. each layer presents well-defined interfaces and hides the specifics of its implementation. For example, a typical computer system has an operating system that provides well-defined access to system resources, middleware that is linked to the application as a set of libraries that abstract things such as message encoding, communication, encryption, and database access, and various layers of abstraction created by the application designer.

With network systems, we often experience similar layers of abstraction but this time across systems. When our network-based software architecture mimics a layered design, we use autonomous processes that communicate with each other via a network interface rather than procedure calls. Each such layer of abstraction is known as a tier in a multi-tier model. It is a generalization of a client-server model.

Centralized
The original, non-networking, computing model is a centralized one, where all computing takes place on a single system.
Client-server
This is the dominant model of interaction in a networked system. One application, called the client (and usually run by the end user), requests something from another application, called a server. The server provides a service. Examples of this are a web browser (client) requesting a web page from a web server, aa mail application (client) accessing a mail server to get mailbox contents, or a print server being given content to print. In this model, clients communicate with the server and not with other clients. The model can be enhanced with multiple “layers,” or services, to mimic a layered architecture, resulting in a build a multi-tier system.
Peer-to-peer
A peer-to-peer architecture employs a collection of applications, any of which can talk to any other.These applications are peers and are generally run by a collection of end users rather than some service provider. The name peer implies that there is no leader: applications all have equal capabilities. An appealing aspect of a peer to peer design is self-scalability. As more and more computers join the collection of peers, the system has more peers to do the work and can hence handle a large workload. Examples of peer-to-peer architectures are BitTorrent and Skype.
Hybrid
A difficulty with peer-to-peer architectures is that one often needs to do things such as keep track of peers, identify which system can take on work or has specific content, and handle user lookup and authentication. This led to a variation of the peer-to-peer model where a coordinator, a central server, is in place to deal with these centralized needs. However, the peers still handle all the bandwidth-intensive or compute-intensive work.

Networking

Go here for lecture notes

Goal: Enable computers to communicate with each other; create the illusion of machine-to-machine and process-to-process communication channels.

Without shared memory, we need a way for collections of systems (computers or other endpoint devices) to communicate. To do so, they use a communication network. If every communicating pair of hosts would have a dedicated physical connection between them, then there would be no sharing of the network infrastructure and we have a true physical circuit. This is not practical since it limits the ability for arbitrary computers to talk with each other concurrently. It is also incredibly wasteful in terms of resource use: the circuit would be present even if no data is flowing.

What is needed is a way to share the network infrastructure among all connected systems. The challenge in doing so is to allow these systems to talk but avoid collisions, the case when two nodes on a network transmit at the same time, on the same channel, and on the same connection. Both signals then get damaged and data does is not transmitted, or is garbled. This is the multiple access problem: how do you coordinate multiple transmitters on a network to ensure that each of them can send their messages?

There are three broad approaches that enable us to do this:

  1. Channel partitioning divides the communication channel into “slots”. If the network is divided into short, fixed-length time slots, we have Time Division Multiplexing, or TDM. Each host must communicate only during its assigned time slot. Routers may connect multiple networks together. When two hosts need to communicate, they establish an end-to-end route called a virtual circuit. It is called a virtual circuit because the setup of this route configures routers and assigns communication slots. This provides the illusion of a true circuit switched network in that all messages are guaranteed to arrive in order with constant latency and a guaranteed bandwidth. The switched telephone network is an example of virtual circuit switching, providing a maximum delay of 150 milliseconds and digitizing voice to a constant 64 kbps data rate.

    If the network is partitioned into frequency bands, or channels, then we have Frequency Division Multiplexing, or FDM. This defines a broadband network. Cable TV is an example of a broadband network, transmitting many channels simultaneously, each in using a well-defined frequency band.

    The problem with a channel partitioning approach is that it is wasteful. Network bandwidth is allocated even if there is nothing to transmit.

  2. Taking turns requires that we create some means of granting permission for a system to transmit. A polling protocol uses a master node to poll other nodes in sequence, offering each a chance to transmit their data. A token passing protocol circulates a special message, called a token, from machine to machine. When a node has a token, it is allowed to transmit and must then pass the token to its neighbor.

    The problem with the taking turns approach is that a dead master or lost token can bring the network to a halt. Handling failure cases is complex. This method was used by networks such as IBM’s Token Ring Network but is largely dead now.

  3. A random access protocol does not use scheduled time slots and allows nodes to transmit at arbitrary times in variable size time slots. This technique is known as packet switching. Network access is accomplished via statistical multiplexing. A data stream is segmented into multiple variable-size packets. Since these packets will be intermixed with others, each packet must be identified and addressed. Packet switched networks generally cannot provide guaranteed bandwidth or constant latency. Ethernet is an example of a packet-switched network.

Packet switching is the dominant means of data communication today. The packets in a packet-switched network are called datagrams and are characterized by unreliable delivery with no guarantees on arrival time or arrival order. Each datagram is fully self-contained with no reliance on previous or future datagrams. This form of communication is also known as connectionless service. There is no need to set up a communication session and hence no concept of a connection. Neither routers nor endpoints need to maintain any state as they have to do with a virtual circuit; there is no concept of where a system is in its conversation.

OSI reference model

Data networking is generally implemented as a layered stack of several protocols — each responsible for a specific aspect of networking. The OSI reference model defines seven layers of network protocols. Some of the more interesting ones are: the network, transport, and presentation layers.

1. Physical
Deals with hardware, connectors, voltage levels, frequencies, etc.
2. Data link
Sends and receives packets on the physical network. Ethernet packet transmission is an example of this layer. Connectivity at the link layer defines the local area network (LAN).
3. Network
Relays and routes data to its destination. This is where networking gets interesting because we are no longer confined to a single physical network but can route traffic between networks. IP, the Internet Protocol, is an example of this layer.
4. Transport
Provides a software endpoint for networking. Now we can talk application-to-application instead of machine-to-machine. TCP/IP and UDP/IP are examples of this layer.
5. Session
Manages multiple logical connections over a single communication link. Examples are SSL (Secure Sockets Layer) tunnels, remote procedure call connection management, and HTTP 1.1.
6. Presentation
Converts data between machine representations. Examples are data representation formats such as XML, JSON, XDR (for ONC remote procedure calls), NDR (for Microsoft COM+ remote procedure calls), and ASN.1 (used for encoding cryptographic keys and digital certificates).
7. Application
This is a catch-all layer that includes every application-specific communication protocol. For example, SMTP (sending email), IMAP (receiving email), FTP (file transfer), HTTP (getting web pages).

Data link layer

Ethernet and Wi-Fi (the 802.11 family of protocols) are the most widely used link-layer technologies for local area networks. Both Wi-Fi and Ethernet use the same addressing format and were designed to freely interoperate at the link layer.

Ethernet provides packet-based, in-order, unreliable, connectionless communications. It occupies layers one and two of the OSI model. There is no acknowledgement of packet delivery. This means that a packet may be lost or mangled and the sender will not know. Communication is connectionless, which means that there is no need to set up a path between the sender and receiver and no need for either the sender or receiver to maintain any information about the state of communications; packets can be sent and received spontaneously. Messages are delivered in the order they were sent. Unlike IP-based wide-area networking, there are no multiple paths that may cause a scrambling of the sequence of messages.

Interfaces communicating at the link layer must use link-layer addressing. A MAC address (for example, an Ethernet address) is different from, and unrelated to, an IP address. An Ethernet MAC address is globally unique to a device and there is no expected grouping of such addresses within a local area network. IP addresses on a LAN, on the other hand, will share a common network prefix.

Network layer: IP Networking

The Internet Protocol (IP) is a network layer protocol that handles the interconnection of multiple local and wide-area networks and the routing logic between the source and destination. It is a logical network whose data is transported by physical networks (such as Ethernet, for example). The IP layer provides unreliable, connectionless datagram delivery of packets between nodes (e.g., computers).

The key principles that drove the design of the Internet are:

  1. Support the interconnection of networks. The Internet is a logical network that spans multiple physical networks, each of which may have different characteristics. IP demands nothing of these underlying networks except an ability to try to deliver packets.

  2. IP assumes unreliable communication. That does not mean that most packets will get lost! It means that delivery is not guaranteed. If reliable delivery is needed, software on the receiver will have to detect lost data and ask the sender to retransmit it. Think of mail delivery: most mail gets to its destination but once in a while, a letter gets lost or takes a really long time to arrive.

  3. Routers connect networks together. A router is essentially a dedicated computer with multiple network links. It receives packets from one network and decides which outgoing link to send the packet.

  4. No central control of the network. The precursor of the Internet was the ARPAnet, built to connect companies and universities working on Department of Defense projects. As such, it was important that there wouldn’t be a single point of failure – a key element that could be taken out of service to cause the entire network to stop functioning.

Since IP is a logical network, any computer that needs to send out IP packets must do so via the physical network, using the data link layer. Often, this is Ethernet, which uses a 48-bit Ethernet address that is completely unrelated to a 32-bit IP address (or a 128-bit IPv6 address). To send an IP packet out, the system needs to identify the link layer destination address (MAC, or Media Access Control address) on the local area network that corresponds to the desired IP destination (it may be the address of a router if the packet is going to a remote network). The Address Resolution Protocol, or ARP, accomplishes this. It works by broadcasting a request containing an IP address (the message asks, do you know the corresponding MAC address for this IP address?) and then waiting for a response from the computer with the corresponding IP address. To avoid doing this for every outgoing packet, ARP maintains a cache of most recently used addresses.

Transport layer: TCP and UDP

IP is responsible for transporting packets between computers. The transport layer enables applications to communicate with each other by providing logical communication channels so that related messages can be abstracted as a single stream at an application.

There are two transport-layer protocols on top of IP: TCP and UDP.

TCP (Transmission Control Protocol) provides reliable byte stream (connection-oriented) service. This layer of software ensures that packets arrive at the application in order and lost or corrupt packets are retransmitted. The transport layer keeps track of the destination so that the application can have the illusion of a connected data stream.

UDP (User Datagram Protocol) provides datagram (connectionless) service. While UDP drops packets with corrupted data, it does not ensure in-order delivery or reliable delivery.

Port numbers in both TCP and UDP are used to allow the operating system to direct the data to the appropriate application (or, more precisely, to the communication endpoint, or socket, that is associated with the communication stream).

TCP tries to give a datagram some of the characteristics of a virtual circuit network. The TCP layer will send packet sequence numbers along with the data, buffer received data in memory so they can be presented to the application in order, acknowledge received packets, and request a retransmission of missing or corrupt packets. The software will also keep track of source and destination addresses (this is state that is maintained at the source and destination systems). We now have the illusion of having a network-level virtual circuit with its preset connection and reliable in-order message delivery. What we do not get is constant latency or guaranteed bandwidth. TCP also implements flow control to ensure that the sender does not send more data than the receiver can receive. To implement this, the receiver simply sends the amount of free buffer space it has when it sends responses. Finally, TCP tries to be a good network citizen and implements congestion control. If the sender gets notification of a certain level of packet loss, it assumes that some router’s queue must be getting congested. It then lowers its transmission rate to relieve the congestion.

The design of the Internet employs the end-to-end principle. This is a design philosophy that states that application-specific functions should, whenever possible, reside in the end nodes of a network and not in intermediary nodes, such as routers. Only if the functions cannot be implemented “completely and correctly,” should any logic migrate to the network elements. An example of this philosophy in action is TCP. TCP’s reliable, in-order delivery and flow control are all is implemented via software on the sender and receiver: routers are blissfully unaware of any of this.

A related principle is fate sharing, which is also a driving philosophy of Internet design. Fate sharing states that it is acceptable to lose the state information associated with an entity if, at the same time, the entity itself is lost. For example, it is acceptable to lose a TCP connection if a client or server dies. The argument is that the connection has no value in that case and will be reestablished when the computer recovers. However, it is not acceptable to lose a TCP connection if a router in the network dies. As long as alternate paths for packet delivery are available, the connection should remain alive.

Acknowledgements

To achieve reliable delivery on an unreliable network, we rely on detecting lost or corrupted packets and requesting retransmissions.

The simplest possible mechanism is to send a packet and wait for the receiver to acknowledge it … then send the next one and wait for that to get acknowledged. This, unfortunately, is horribly inefficient since only a single packet is on the network at any time. It is more efficient to use pipelining and send multiple packets before receiving any acknowledgements. Acknowledgements can arrive asynchronously and the sender needs to be prepared to retransmit any lost packets.

It would be a waste of network resources for the TCP layer to send back a packet containing nothing an acknowledgement number. While this is inevitable in some cases, if the receiver happens to have data to transmit back to the sender, the acknowledgement number is simply set in the TCP header of the transmitted segment, completely avoiding the need to send a separate acknowledgement. Using an outgoing data segment to transmit an acknowledgement is known as a piggybacked acknowledgement.

TCP also uses cumulative acknowledgements. Instead of sending an acknowledgement per received message, TCP can acknowledge multiple messages at once.

Sockets

Sockets are a general-purpose interface to the network provided to applications by the operating system. By this, we mean that they were not designed to support one specific network but rather provide a generic mechanism for inter-process communication. They are the only way that an application can interact with the network.

They are created with the socket system call and assigned an address and port number with the bind system call. For connection-oriented protocols (e.g., TCP), a socket on the server can be set to listen for connections with the listen system call. The accept call blocks until a connection is received, at which point the server receives a socket dedicated to that connection. A client establishes a connection with the connect system call. The “connection” is not a a configuration of routers as with virtual circuits; it is just state that is maintained by the transport layer of the network stack in the operating system at both endpoints. After this, sending and receiving data is compatible with file operations: the same read/write system calls can be used. When communication is complete, the socket can be closed with the shutdown or close system calls.

With sockets that use a connectionless protocol (e.g., UDP), there is no need to establish a connection or to close one. Hence, there is no need for the connect, listen, or shutdown system calls. The sendto and recvfrom system calls were created to send and receive datagrams since the read and write system calls do not enable you to specify the remote address. sendto allows you to send a datagram and specify its destination. recvfrom allows you to receive a datagram and identify who sent it.

Protocol encapsulation

We saw that if we want to send an IP packet out on an Ethernet network (IP is a logical network, so there is no physical IP network), we needed to send out an Ethernet packet. The entire IP packet becomes the payload (data) of an Ethernet packet. Similarly, TCP and UDP have their own headers, distinct from IP headers (they need a port number, for example). A TCP or UDP packet is likewise treated as data by the IP layer. This wrapping process is known as protocol encapsulation.

Remote Procedure Calls

Go here for lecture notes

Goal: Provide a layer of abstraction for process-to-process communication that enables a process on one system to invoke a function, or service, on a another system without having to deal with the problems of formatting data and parsing the request.

One problem with the interface offered by sockets was that it offered a send-receive model of interaction. However, most programs use a functional (procedure call) model. Remote procedure calls are a programming language construct (something provided by the compiler, as opposed to an operating system construct such as sockets). They provide the illusion of calling a procedure on a remote machine. During this time, execution of the local thread stops until the results are returned. The programmer is alleviated from packaging data, sending and receiving messages, and parsing results.

The illusion of a remote procedure call is accomplished by generating stub functions. On the client side, the stub (sometimes known as a proxy) is a function with the same interface as the desired remote procedure. Its job is to take the parameters, marshal them into a network message, send them to the server, await a reply, and then unmarshal the results and return them to the caller. On the server side, the stub (sometimes known as a skeleton) is responsible for being the main program that registers the service and awaits incoming requests for running the remote procedure. It unmarshals the data in the request, calls the user’s procedure, and marshals the results into a network message that is sent back to the recipient.

There are a few hurdles to overcome in implementing remote procedure calls:

Parameter passing
Most parameters in our programs are passed by value. That is easy to do remotely: just send the data in a network message. Some parameters, however, are passed by reference. A reference is a memory address of the parameter. The problem with passing this is that memory is local and the memory address passed from a client to a server will now refer to memory on the server rather than to the contents on the client. There is no good solution to this except to understand the data that is being referenced, send it to the remote side (pass by value), where it will be placed in some temporary memory. A local reference can then be passed to the server function. Because the contents might have been modified by the function, the data will need to be sent back to the calling client and copied back to its original location.
Marshaling
All data that is sent needs to be represented as a series of bytes that can be placed into one or more network messages. This is known as marshaling. Not only must any data structure be sent in a serialized format: a sequence of bytes with no pointers, but the format of this marshaled data must be standardized between the client and server so that the server can make sense of the data it receives and vice versa. Different processors and languages may use different conventions for integer sizes, floating point sizes and formats, placement of most significant bytes, and alignment of data.

The marshaled data that is sent over the network may contain data that makes it self-describing: identifying the individual parameters by name and type. Such a format is known as explicit typing. JSON and XML formats are examples of this. In contrast, implicit typing does not send such information and requires the remote side to know the precise expected sequence of parameters.

Generating stubs
Since most languages do not support remote procedure calls natively, something has to generate client and server stubs. That is often a stand-alone program known as an RPC compiler. The RPC compiler takes an interface specification as input and generates client-side stubs (proxies) and a server-side proxy (skeleton). The interface specification is written in an interface definition language (IDL) and defines remote classes, methods, and data structures. It contains all the information that the RPC compiler needs to generate stub functions.
Looking up services
A client process needs to find out how to set up a network connection to an appropriate service that hosts the remote procedures: which host and port to use. An RPC name server is a network service that a client can communicate with to query the host and port of a desired remote interface. The client sends the server a name identifying the interface. That “name” is often a number that uniquely identifies the service that hosts a set of functions on the server. The server returns a host and port number for the service that implements the functions. In many cases, the name server resides on the machine where the remote procedures run and the server will return only the port number. When a service starts up, it will register its interfaces with the RPC name server.
Handling failures
We don’t have a concept of local procedure calls not working. With remote calls, however, problems can arise. The server can stop working or network connectivity may break or experience unexpected delays. These may prevent or delay requests reaching the server or responses reaching the client. To combat this, RPC libraries may attempt to retransmit requests if a response is not received in time. This may have the side-effect of invoking a procedure more than once if the network is slow. In some cases, no harm is done in doing this. Functions that may be run multiple times without undesirable side-effects are called idempotent functions. Functions that have undesirable side-effects if run multiple times (e.g., transfer $500 from my checking account to my savings account) are called non-idempotent functions. Most RPC systems offer at least once semantics, in which case a remote procedure will be executed one or more times (if there are network delays) or at most once semantics, in which case a remote procedure library will avoid resending the procedure request even if it does not get a timely response. Software that uses remote procedure calls has to be prepared to deal with errors that can arise from problems in contacting or getting a response from a remote procedure.

Remote Procedure Calls: case studies

Go here for lecture notes

Sun (ONC) RPC

Sun’s RPC, formally called ONC (Open Network Computing) RPC was one of the first RPC systems to achieve widespread use, thanks to the early popularity of Sun workstations, servers, and the Network File System (NFS). It is still in use on virtually all UNIX-derived systems (Linux, macOS, *BSD, SunOS). It uses an RPC compiler called rpcgen that takes input from an interface definition language (IDL) file. This is a file that defines the interfaces to the remote procedures. The interfaces are a set of functions that could be called by clients, including their parameters and return types. The programmer can also define multiple versions of an interface. This is useful since services may evolve over time but one cannot always count on all clients getting updated; some clients may still call functions on the old interfaces, which might take different parameters or have different names. From this IDL file, rpcgen creates client stub functions and a server stub program. These can be compiled and linked with the client and server functions, respectively.

A programmer must assign a program number to each interface, that is, each set of server functions in the IDL file. This is a 32-bit number that must be a unique ID on that server. When the server starts up, it binds a socket to any available port and registers that port number and interface’s program number with a name server, known as the portmapper, running on the same machine. A client, before it can invoke any remote procedure calls, contacts the portmapper on the specified server with the program number to find the port to which it needs to send its requests.

The choice of transport protocol, UDP or TCP, can be specified at run-time. All incoming and return parameters are marshaled into a standard format called XDR, or eXternal Data Representation.

DCE RPC

The Distributed Computing Environment, defined by the Open Group, created its own flavor of RPC which was similar to Sun RPC. They also had the programmer specify an interface in an IDL, which they called the Interface Definition Notation (IDN).

To avoid the problem of picking a unique 32-bit identifier for the interface, DCE RPC provides the programmer with a program called uuidgen. This generates a unique universal ID (UUID) – a 128-bit number that is a function of the current time and ethernet address.

The Distributed Computing Environment also introduced the concept of a cell, which is an administrative grouping of machines. Each cell has a cell directory server that maintains information about the services available within the cell.

Each computer in the cell knows how to contact its cell directory server. When a server program starts up under DCE RPC, it registers its port and the interface’s UUID with a local name server (the DCE host dæmon, dced, which is similar to the portmapper we find on Linux and BSD systems). It also registers the UUID-to-host mapping with the cell directory server. This allows for location transparency for services: a client does not need to know what machine a service lives on a priori. The client queries the cell server with the UUID of the interface to find the system on which the desired service is running. Then it contacts the local name server to get the port number and transport type on which to make requests.

A standard way of encapsulating data (marshaling) is crucial since encodings may differ between different machines. Sun defined a standard format called XDR (eXternal Data Representation). Every participating system must convert (marshal) data into this format. DCE defined a format called NDR (Network Data Representation). However, instead of creating a single set of definitions, NDR defines a set of data representations that can be used. The hope is that the sender can find a format that will require minimal or no data conversion (hence, greater efficiency). If the client uses the same system architecture, it also will not need to convert data. In the worst case, only one side will need to convert data. This is known as a multi-canonical approach to data conversion.

As object oriented languages gained popularity in the late 1980s and 1990s, RPC systems like Sun’s and DCE’s proved incapable of handling some object oriented constructs, such object instantiation or polymorphism (different functions sharing the same name, with the function distinguished by the incoming parameters). Creating objects, in particular, requires a need for memory allocation on the server and cleanup when these remote objects are no longer needed. This is called distributed garbage collection. A new generation of RPC systems dealt with these issues.

Microsoft COM+/DCOM & ORPC (MS-RPC)

Microsoft already had a mechanism in place for dynamically loading software modules, called components, into a process. This was known as COM, the Component Object Model and provided a well-defined mechanism for a process to identify and access interfaces within the component. The same model was extended to invoke remotely-located components and became the Distributed Component Object Model (DCOM), later fully merged with COM and called COM+. Because remote components cannot be loaded into the local process space, they have to be loaded by some process on the remote system. This process is known as a surrogate process. It runs on the server (under the name dllhost.exe), accepting remote requests for loading components and invoking operations on them.

COM+ is implemented through remote procedure calls. The local COM object simply makes RPC requests to the remote implementation. Microsoft enhanced the DCE RPC protocol slightly to create what they called Object RPC (ORPC). This is essentially DCE RPC with the addition of support for an interface pointer identifier (IPID). The IPID provides the ability to identify a specific instance of a remote class. Interfaces are defined via the Microsoft Interface Definition Language (MIDL) and compiled into client and server side stubs. The client-side stub becomes the local COM object that is loaded on the client when the object is activated by the client program. Like DCE, ORPC supports multi-canonical data representation. The remote, server-side, COM object is loaded by the server’s surrogate process when first requested by the client.

Since objects can be instantiated and deleted remotely, the surrogate process needs to ensure that there isn’t a build-up of objects that are no longer needed by any client. COM+ accomplishes this via remote reference counting. This is an explicit action on the part of the client where the client sends requests to increment or decrement a reference count on the server. When the reference count for an object drops to zero, the surrogate process deletes that object. To guard against programming errors or processes that terminated abnormally, a secondary mechanism exists, called pinging. The client must periodically send the server a ping set – a list of all the remote objects that are currently active. If the server does not receive this information within a certain period, it deletes the objects. This is a form of leasing, where the object expires if a lease is not renewed periodically. However, there is a subtle difference. With leasing, the lifetime of an object is generally renewed whenever an object is accessed, so there is no need for the client to ping the server to renew a lease unless it has not accessed the object for the duration of the lease.

Java RMI

When Java was created, it was designed to be a language for deploying downloadable applets. In 1995, Sun extended Java to support Remote Method Invocation (RMI). This allows a programmer to invoke methods on objects that are resident on other JVMs.

Since RMI is designed for Java, there is no need for OS, language, or architecture interoperability. This allows RMI to have a simple and clean design. Classes that interact with RMI must simply play by a couple of rules. All parameters to remote methods must implement the serializable interface. This ensures that the data can be serialized into a byte stream (marshaled) for transport over the network. Serialization is a core aspect of marshaling: converting data into a stream of bytes so that it can be sent over a network or stored in a file or database. All remote classes must extend the remote interface. A remote interface is one whose methods may be invoked from a different Java virtual machine. Any class that is defined as extends Remote can be a remote object. Remote methods within that class must be capable of throwing a java.rmi.RemoteException. This is an exception that the client RMI library will throw if there is a communication error in calling the remote method.

RMI provides a naming service called rmiregistry to allow clients to locate remote object references. These objects are given symbolic names and looked up via a URI naming scheme (e.g., rmi://cs.rutgers.edu:2311/testinterface).

Java’s distributed garbage collection is somewhat simpler than Microsoft’s COM+. Instead of reference counting, it uses a form of leased-based garbage collection. There are two operations that a client can send to the server: dirty and clean. When the first reference to a remote object is made, the client JVM sends a dirty message to the server for that object. As long as local references exist for the object, the client will periodically send dirty messages to the server to renew the lease on the object. When the client’s JVM’s garbage collector detects that there are no more references to the object, it sends a clean message for that object to the server. Should the client exit abnormally, it will not renew its lease by refreshing the dirty call, so the lease will expire on the server and the server will destroy the object.

Web Services

In the late 1990s, the web browser became the dominant model for user interaction on the Internet. It used HTTP, the Hypertext Transfer Protocol, over TCP for requests and responses and formatted web pages with HTML, the Hypertext Markup Language. While this created a decent user experience, dealing with fully-formatted pages was not good for programmatic access to that data The user interface content (tables, images, text formatting) was a major component of the content. Approaches such as site scraping were often employed, where a program would request and receive an HTML page and then parse through tons of formatting directives to access the data it needed.

What we wanted was remotely-hosted services that programs, not users, can access. This enables machine-to-machine (m2m) communication. Remote procedure calls would seem to be a natural choice for this but they also had some problems when used on the Internet outside of an organizations’ LAN:

1. Because of the convenience of “you don’t need to pick a port”, RPC solutions typically ran services over an arbitrary range of ports where the operating system selected an unused port and the service registered it with an RPC name server. This led to an administrative nightmare where an administrator could not set a firewall rule to allow or block a specific port.

  1. Even though some RPC solutions were designed to support multiple languages and operating systems, most RPC systems were really deployed with a limited set of environments in mind. Sun RPC did not work on IBM’s flavor of UNIX and DCE RPC did not work on Sun or Linux systems. Microsoft’s services were difficult to use outside of the Microsoft ecosystem. Some cross-platform solutions, such as COBRA, were sold by multiple vendors and were not always interoperable.

  2. It turns out that we often need more than RPC’s request-response style of interaction. For example, we might want to implement a subscribe-publish interface where a client requests to be informed when a certain event takes place. The server will, at future times, send messages informing of these events. In many cases, we just want to send and receive general-purpose messages.

  3. RPC systems were designed with local area networks in mind. This meant that cleints expected low latency to server. The high latency to remote, loaded servers could lead to excessive retries (which generates even more server load) as well as clients giving up and returning a failure because a response was too slow to arrive.

  4. Finally, state management was somewhat of an issue. Although RPC does not require that servers store client state, a distributed object model makes this the norm. Large-scale deployments could get bogged down by the memory use of objects created to service requests that have not yet been garbage collected.

Web services are a set of protocols by which services can be published, discovered, and used over the Internet in a technology neutral form. This means that they are designed to be language and architecture independent. Applications will typically invoke multiple remote services across different systems, sometimes offered by different organizations.

The general principles of web services are:

  • Use text-based payloads, called documents, marshaling all data into formats such as XML or JSON. This ensures that the marshaling format does not favor a specific processor architecture or programming language. It also ensures that content-inspecting firewalls do not cause problems.

  • Use HTTP over TCP/IP for transport. This allows us to use existing infrastructure: web servers, firewalls, and load balancers.

  • Tolerate high latency. Servers are likely not to be on a local area network and may be slow to respond either due to their own load or due to network latency. This means that, where possible, programmers should strive for asynchronous interactions: dispatch a request and see if you can do something else useful before you get the response.

  • Tolerate a large number of clients. Applications that interact with web services tend to be more loosely-coupled than those that use distributed objects (remote procedure calls). Services may be run by different organizations and servers cannot count on well-behaved clients or a small number of them. When possible, the ideal design is a stateless one where each client request contains all the necessary data and the server does not have to store any state between requests. Documents, the unit of message exchange in web services, tend to be self-describing. That is, they will identify and itemize all the parameters (explicit typing) and also describe any additional state needed for the interaction.

The use of web services leads to a programming model called Service Oriented Architecture (SOA). Under SOA, an application is the integration of network-accessible services where each service has a well-defined interface. These services, or components, are unassociated and loosely coupled. By unassociated we mean that neither service depends on the other one; they are all mutually independent. By loosely coupled we mean that neither service needs to know about the internal structure of other services. To maintain this independence, web services generally forgo distributed garbage collection.

Functionally, you can do anything with web services that you can with distributed objects (RPC). The differences are usually philosphical. Web services focus on document exchange and are designed with high latency in mind. Document design is central to web services. Distributed objects tend to look at the world in a way where interfaces are the key parts of the design. The data structures (“documents”) passed in these interfaces just package the data for use by them.

XML RPC

XML-RPC was created in 1998 as a simple protocol that marshals all requests and responses into XML messages. It is essentially a marshaling protocol and the standard does not define an IDL or stub function generator. There are a lot of libraries to support XML RPC and some languages implement it more transparently than others. XML-RPC is just a messaging format. Nothing in the spec has support for remote objects, object references, or garbage collection. The protocol was designed when the dominant vision of web services was that of RPC-like interactions. This turned out not to always be the case. For example, one might want to implement the subscribe-publish model we mentioned earlier, where a client would subscribe to receive published notifications of specific events from a server.

SOAP

XML RPC took an evolutionary fork and, with the support of companies such as Microsoft and IBM, evolved into something known as SOAP, the Simple Object Access Protocol. The acronym has since been deprecated since SOAP is neither simple nor confined to accessing objects. XML RPC is a subset of SOAP. In addition to remote procedure calls, SOAP added support for general purpose messaging (sending, receiving, and asynchronous notification of messages). SOAP invocations are always XML messages that are usually sent via an HTTP protocol. However, HTTP transport is not a requirement; you can send a SOAP message via email and SMTP (Simple Mail Transport Protocol).

SOAP services can be described via a Web Services Description Language (WSDL) document. This is another XML document that essentially serves as an interface definition and defines all the names, operations, parameters, destination, and format of requests. WSDL is somewhat complex for human consumption. Typically, one creates an interface definition in a language such as Java and then uses a tool to translate that definition into a WSDL document. That WSDL document can then be fed to another tool (often by another programmer) to generate code that can be used to invoke remote functions or send remote messages.

In addition to WSDL, SOAP was also augmented with a directory service for storing and serving information about web services. The service is called UDDI, for Universal Description, Discovery, and Integration and uses SOAP to communicate, providing WSDL documents as a result. UDDI never really achieved widespread popularity or deployment.

JAX-WS: Java Web Services

As web services became popular, quite a few services to support them were created for the Java platform. One of the more popular ones, and supported by the Oracle (the owner of Java), is JAX-WS (Java API for XML Web Services). The goal of JAX-WS is to invoke Java web services using Java RMI. Unlike traditional Java RMI, interoperability is important since the remote side (client or server) may not necessarily use Java. JAX-WS uses SOAP and WSDL to achieve platform independence.

REST

REST (REpresentational State Transfer) is a departure from the approach of SOAP. Instead of using HTTP simply as a conduit for sending and receiving XML messages where everything you need is contained in the body, REST incorporates itself into the HTTP protocol. In particular, the URL incorporates the request and list of parameters. The protocol intself defines the core nature of the operation:

  • HTTP PUT: create something
  • HTTP GET: read something
  • HTTP POST: update something
  • HTTP DELETE: delete something

The body of the message will contain the document, which will be formatted data and not, as in the case, a structure that also identifies the operations to be performed on the document.

Marshaling formats

When web services were first developed, the obvious marshaling format to use was XML. This was, roughly, what HTML used for describing the content of web pages (HTML was not particularly strict about proper structure). Its use was adopted for XML-RPC and SOAP and it remains heavily in use. However, it turned out to be a rather text-heavy protocol that was complex to parse. A lightweight alternative that gained much popularity is JSON (JavaScript Object Notation). Despite the JavaScript in the name, it was designed to be language-independent and easy to parse. It was touted as the “fat-free alternative to XML.” Even more efficient is the use of Google Protocol Buffers. This is a binary marshaling protocol and is not always suited for web services over HTTP but is phenomenally efficient for local services and for storing serialized data (e.g., saving objects in a file system).

Clock synchronization

Go here for lecture notes

Goal: Enable clocks on multiple machines to be synchronized to the same time.

We would like our computers to have a knowledge of the current time of day. By this, we mean UTC time, Coordinated Universal Time (or Temps Universel Coordonné), formerly called Greenwich Mean Time[1]. This is the international standard time of day. From this, we can present time in the user’s local time zone and adjust for daylight saving time. Note that clients or users accessing a computer could be in a different time zone. We want a consistent time so we can make sense of timestamps on file data, mail messages, databases, and system logs. These timestamps can be used for software development environments (know what still needs to be compiled), versioning tools, and database queries.

Keeping track of time is difficult. No two clocks tick in perfect synchrony with each other. Quartz oscillators, which drive the timekeeping mechanisms of clock circuits, are not consistent over time and no two tick at exactly the same rate. The difference between two clocks at any given instant is the clock offset. The rate at which the clocks are drifting is the clock drift. The variation of network delay is called jitter. Jitter is useful for assessing the consistency of our interaction with servers.

We can correct for drift simply by changing the value of a system’s clock to reflect that of UTC time. However, we do not want to provide the illusion of time moving backward to any process that is observing the clock. A linear compensation function adjusts the rate at which time is measured on a computer (e.g., number of ticks that make up a second).

Cristian’s algorithm sets the time on a client to the time returned by the server plus an offset that is one half of the transit time between the request and response messages: Tclient = Tserver + ½(Treceived - Tsent).

It also allows one to compute the maximum error of the new time stamp. The error is ± ½[(round-trip time) - (best-case round-trip time)]. Errors are additive. If you incur an error of ±50 msec and the server’s clock source has an error of ±80 msec, your clock’s error is now ±130 msec.

The Berkeley algorithm does not assume the presence of a server with an accurate time (i.e., one that keeps track of UTC time). Instead, one system is chosen to act as a coordinator. It requests the time from all systems in the group (including itself) and computes a fault-tolerant average (an arithmetic average, dismissing systems whose time values differ by more than a certain amount). It then sends each machine an offset by which to adjust its clock.

The Network Time Protocol, NTP, was created to allow a large set of machines to synchronize their clocks. A set of machines act as time servers. This collection of machines is known as the synchronization subnet. The subnet is hierarchical, with the time server’s stratum defined as the number of hops it is from a machine that synchronizes from a direct time source. Machines that are directly connected to a time source (e.g., a GPS receiver) are at stratum 0. Machines that synchronize from a system at stratum 0 are at stratum one, and so on. The Simple Network Time Protocol, SNTP, is a restricted form of NTP that does not support peer-to-peer synchronization of time servers. The peer-to-peer mode maintains state on drift and synchronization time and is intended for time servers to synchronize with each other. SNTP is essentially the same as Cristian’s algorithm. The formula for NTP and SNTP is time_offset = ½ (T2 - T1 + T3 - T4) where T1 is the time the message left the client, T2 is the time it arrived at the server, T3 is the time the response left the server, and T4 is the time that it arrived at the client. If we let TS = ½(T2+T3) then we arrive at Cristian’s formula.

NTP encourages clients to try several servers. For each server contacted, the protocol computes

Offset:
The difference between the client’s clock and the server’s. This is how much the client needs to offset its clock to set it to the server’s time.
Delay:
This is an estimate of the time spent sending or receiving the message. It is one half of the round-trip time minus the estimate of time spent on the server.
Jitter:
Jitter measures the variation in delay among multiple messages to the server. It gives us an idea of how consistent the latency is between the client and server.
Dispersion:
Dispersion is the estimated maximum error for the computed offset. It takes into account the root delay (total delay, not just to the server but the delay from the server to the ultimate time source), estimated server clock drift, and jitter.

Given the choice of several NTP servers, NTP picks the server with the lowest dispersion. That is, the server from which it can get the most consistent and accurate time. If there is a tie, it then picks one with the lowest stratum; that is, one closest to the master time source. Note that this may result having a client synchronize from a higher-stratum server even if it can contact a lower-stratum one (one that is closer to the time source). This may happen if the client can access that server more reliably and with less delay and jitter than a “better” server.

The Precision Time Protocol (PTP, an IEEE standard), was designed with different goals than NTP. While NTP was designed for synchronizing time with machines across the Internet and remote servers, PTP was designed for LANs: networks with low jitter and low latency. PTP is intended to achieve sub-microsecond precision among cooperating machines. Clock synchronization is initiated by a PTP master (unlike NTP, where the client host initiates the sync). The master announces its time to clients via a sync message. If a client is then interested in synchronizing its clock, it must communicate back to the server in order to discern the delay in communication. The client sends a delay request message and notes the time it was sent. The server sends back a delay response message containing the time of arrival of the delay request message. With this data, and the assumption that uplink and downlink latency is the same, the client can compute the server’s timestamp with an adjustment for the network transit delay.


  1. The abbreviation UTC is a compromize between the English Coordinated Universal Time and the French Temps Universel Coordonné.  ↩

Logical clocks

Go here for lecture notes

Goal: Allow processes on different systems to identify and causal relationships and their ordering among events, particularly among messages between different systems.

Lamport clocks allow one to assign sequence numbers (“timestamps”) to messages and other events so that all cooperating processes can agree on the order of related events. There is no assumption of a central time source and no concept of total ordering (when events took place). The central concept with logical clocks is the happened-before relation: a→b represents that event a occurred before event b. This order is imposed upon consecutive events at a process and also upon a message being sent before it is received at another process. Beyond that, we can use the transitive property of the relationship to determine causality: if a→b and b→c then a→c. If there is no causal relationship between two events (e.g., they occur on different processes that do not exchange messages or have not yet exchanged messages), the events are concurrent.

Lamport’s algorithm states that every event is timestamped (assigned a sequence number) and each message carries a timestamp of the sender’s clock (sequence number). A message comprises two events: (1) at the sender, we have the event of sending the message and (2) at the receiver, we have the event of receiving the message. The clock is a process-wide counter (e.g., a global variable) and is always incremented before each event. When a message arrives, if the receiver’s clock is less than or equal to the message’s timestamp, the clock is set to the message timestamp + 1. This ensures that the timestamp marking the event of a received message will always be greater than the timestamp of that sent message.

One result of Lamport timestamps is that multiple events on different processes may all be tagged with the same timestamp. We can force each timestamp to be unique by suffixing it with a globally unique process number. While these new timestamps will not relate to real time ordering, each will be a unique number that can be used for consistent comparisons of timestamps among multiple processes (e.g., if we need to make a decision on who gets to access a resource based on comparing two timestamps).

A second deficiency with Lamport timestamps is that, by looking at timestamps, one cannot determine whether there is a causal relationship between two events. For example, just because event a has a timestamp of 5 and event b has a timestamp of 6, it does not imply that event a happened before event b.

Vector clocks

A way to create timestamps that allow us to discern causal relationships is to use a vector clock. A vector clock is no longer a single value but rather a vector of numbers, with each element corresponding to a process. The rules are as follows:

  1. Before affixing a vector timestamp to an event, a process increments the element of its local vector that corresponds to its position in the (for example, process 0 increments element 0 of its vector; process 1 increments element 1 of its vector).

  2. When a process receives a message, it also first increments its element of the vector (i.e., it applies the previous rule). it then sets the vector of the received event to a set of values that contains the higher of two values when doing and element-by-element comparison of the original event’s vector and the vector received in the message.

This new vector timestamp becomes the new per-process vector from which future timestamps for events on that process will be generated. Think of a vector timestamp as a set of version numbers, each representing a different author.

For example, in an environment of four processors, P0, … P3, P1 will “own” the second element of the vector. If one event on P1 is (2, 4, 0, 1) then the next event will be (2, 5, 0, 1). If the event after that is the receipt of a message with a timestamp of (3, 2, 9, 8) then the timestamp will be created by setting an element-by-element maximum of (2, 6, 0, 1) and (3, 2, 9, 8), which is (3, 6, 9, 8). We can illustrate this with the following pseudocode where e is the system’s vector counter and r is the received vector timestamp:

/* receive message with vector timestamp r */
e[myproc]++;    /* increment our process' index */
for (i=0; i < num_procs; i++) {
    if (e[i] < r[i])
        e[i] = r[i];
}

Two events are concurrent if one vector timestamp is neither greater than or equal nor less than or equal to the other element when doing an element-by-element comparison. For example, events (2, 4, 6, 8) and (3, 4, 7, 9) are not concurrent (i.e., they are causally related) because every element of the first vector is less than or equal to the corresponding element of the second vector. The vectors (2, 4, 6, 8) and (1, 5, 4, 9) represent concurrent events. Neither vector is less than or equal to the other. For instance, 2 ≥ 1 (first element of the first vector is greater than the first element of the second vector) but 4 ≤ 5 (second element of the first vector is less than the second element of the second vector).

Note that in an environment where the overall set of processes is not known by each process, a vector timestamp may consist of a set of <process, number> tuples. In such a case, the timestamping and comparison process is exactly the same: numbers are compared with those of matching processors and any missing process is treated as having a value of 0 in that vector.

Group communication

Goal: Provide mechanisms for a process to send a message to a group of processes instead of to one process. Consider the issues of reliability and message ordering.

Point-to-point communication is known as unicast. This is what we generally use to communicate a single client and server. There are other modes of communication. Broadcast is the sending of data to every host on the network. Anycast is point-to-point communication (as in unicast) but the receiver is the nearest one of receivers with the same address (for example, IPv6 uses this to allow a host to update the routing table of the nearest host). Anycast is a special purpose form of unicast that will will not discuss here.

An alternative to point-to-point communication is group communication, or point-to-multipoint messaging. Group communication is known as multicast and provides the abstraction of allowing a process to send a single message that is delivered to all group members. True multicast is handled by the network and offers the efficiency of avoiding replicated messages. The sender sends a single message on a special multicast address. Interested parties listen on that address and the network takes care of the delivery. If multicast is not available, it can be simulated by invoking multiple unicasts, one to each recipient.

There are two considerations in group communication (multicast): reliability and message ordering.

Message receipt versus message delivery

We talk about a process receiving a message, which means that it is received from the network and handled by a multicast receiving algorithm. This algorithm decides when to deliver the message to the application logic.

Since receivers cannot control the order in which messages are received over the network, a layer of software is responsible for transmitting a multicast message and, at the receiver, receiving it and deciding when and if to make it available to the application (i.e., deliver it). When a message is received, the multicast receiving algorithm may take one of three actions:

  1. Discard the message. This may be done if the receiver is no longer a member of the group or if the message is a duplicate.

  2. Deliver the message. The message is placed into a FIFO (first-in, first-out) queue from which the application reads incoming multicast messages.

  3. Hold the message. The message is not ready to be delivered to the application. Most likely, this is because it has not arrived in the expected order and the algorithm must first receive an earlier message. Alternatively, it may need to be held to get feedback that all other members have received it before passing it to the application. In cases like this, the message is placed on a hold-back queue. When the next message is received, the algorithm may check the hold-back queue to determine whether any held messages can now be delivered or discarded.

Reliability

An atomic multicast requires that a message must reach all group members (if one host cannot get the message, no others can process it). This multicast must survive machines going down – both the sender and/or receivers. If a recipient is not certain that a message has been received by all group members, it cannot deliver it to the application. Because of this, it requires the most overhead in implementation, often employing persistent logs.

A reliable multicast is a multicast where the sender tries its best to ensure that messages get to group members. The sender sends a message and waits for an acknowledgement. If it doesn’t receive the acknowledgement in time, it will retransmit the message. It will try this again and again until, eventually, after a longer interval of time, it will give up and assume the receiver is dead.

An unreliable multicast doesn’t wait for acknowledgements and will generally use whatever underlying multicast mechanism is provided.

Message ordering

The issue in message ordering is that multiple hosts can be sending messages to the entire group at the same time. Will each host receive all the messages in the exact same order? Will each host receive all the messages in the order they were sent?

Global time ordering requires that all messages arrive in the exact order they were sent: if host A sent a message 5 nanoseconds before host B did then all hosts should receive the message from host A first. This is impossible to implement (clocks can’t be that perfectly synchronized and the chance of a tie is always present; also, networks will not be able to guarantee this ordering if routing is involved). A more practical approach is total ordering. This requires that all messages arrive in the same order at all hosts. This can be easily achieved by providing a group-wide mechanism for obtaining a totally sequenced message ID: for example, from a sequence number server.

Causal ordering means that messages that are causally related (according to Lamport’s happened before definition) will be delivered in order to all group members. Concurrent messages may be delivered in any order. One way of implementing causal ordering is by having each process keep a precedence vector and sending it along with every message that is sent to the group. A precedence vector is similar to a vector timestamp with the exception that an event counter is not incremented for received messages. The vector applies only to events that relate to sending messages to a specific group. Each entry in the precedence vector represents the latest message sequence number that the corresponding process (group member) knows about.

When a process Psender sends a message, it increments the element of the vector that corresponds to its own entry:

Vsender[sender] = Vsender[sender] + 1

This vector is sent along with the message.

When a process Preceiver receives a message from Psender, it checks two conditions:

(1) The message must be the very next message from Psender. That is, the value of the sequence number in the received vector that corresponds to Preceiver must be exactly one greater than the one we have for that process in our vector:

(Vsender[i] == Vreceiver[i] + 1) ?

This tells us that we are receiving messages in sequnce from Psender. If the value is more than one greater, it means that we missed one or more messages from that sender. This message will have to go on the hold-back queue until the the earlier messages arrive.

(2) The message should not be causally dependent on another message that the receiver has not yet seen. This means that every other element of the vector has to be less than or equal to the corresponding element of the vector at the receiver.

∀i, i ≠ sender: (Vsender[i] ≤ Vreceiver[i]) ?

If the vector from the sender contains some sequnce number that is greater than a corresponding sequence number in the receiver’s vector, that means that the sending process has seen a message from some other process that the receiver has not yet processed. Since the precedence vector is used for group communication and the reciver is part of the group, that means the receiver needs to wait for that message to come.

If both conditions are satisfied, then the received message can be delivered to the application immediately. Otherwise, it is placed in the hold-back queue until the conditions can be satisfied. Causal ordering has the advantage that there is no need for a global sequencer as in total ordering. Note that the precedence vector technique requires reliable message delivery. Otherwise, the receiver will not know if a message was lost or if a message has just not yet arrived.

Sync ordering requires a special type of message — a sync message. When this is issued, any messages that were already sent have to be processed (and acknowledged to the senders). The sync assures that any messages sent before the sync will be processed before any messages after the sync (globally – for all hosts).

FIFO (first in, first out) ordering on messages ensures that messages from each source are delivered in order but messages from multiple sources may be interleaved in any order at the receiver. For example, if host A sends messages m1, m2, m3 and host B sends messages n1, n2, n3, it is valid for host C to receive the sequence m1, m2, m3, n1, n2, n3 and for host D to receive the sequence m1, n1, n2, m2, n3, m3.

Finally, an unordered multicast doesn’t impose any message ordering among messages from other hosts.

IP multicast

IP multicasting is designed, like IP, to span multiple physical networks. Membership is dynamic: a machine can join or leave a multicast group at any time. Moreover, there is no central coordinator and no restriction on the number of hosts that can be in a group. Multicasting provides network efficiency. Packets in a multicast stream only need to be replicated when a router needs to send them to multiple network links. Only one stream of packets is needed on any network segment regardless of the number of receivers.

An IP multicast address (also known as a class D address) is an IP address that starts with 1110 and contains a 28-bit multicast address. A host may join this address and receive messages addressed to that multicast ID. Within a LAN, an IP class D address is mapped onto an Ethernet multicast address by copying the least-significant 23 bits of the address onto an Ethernet multicast address. Within a LAN, an ethernet chip is programmed to to perform an exact match on a small set of addresses or to accept addresses that hash to particular values. The ethernet driver will need to remove any unneeded addresses that pass through. The ethernet chip can also be set to multicast promiscuous mode, where it will accept all multicast ethernet packets.

Routers have to get involved to support multicasting beyond the local area network. Two protocols are used to implement multicasting. The Internet Group Management Protocol (IGMP) is designed for hosts to inform the routers on their LAN that they are interested in joining a multicast group. The Protocol Independent Multicast (PIM) protocol enables routers to tell their neighboring routers that they are, or are no longer, interested in receiving packets for a particular multicast group.

A host uses IGMP to send a multicast join message (also known a a membership report) to join a specific group. A multicast-aware router will get this message and now know that the the link on which the message arrived needs to receive any packets addressed to that multicast group.

Periodically, a router will send a membership query message to all hosts on the LAN. If any node is still interested in the group, it must re-send a join message. In version 1 of the protocol, if no join messages are received, then the router would stop responding to join messages and the LAN will no longer receive packets for that group. With IGMP v2, a leave message was added to avoid having to wait for the timeout to realize that nobody is interested in a group. That avoids needlessly sending multicast traffic on a LAN where no hosts are interested in receiving the messages anymore.

A lingering problem was that multicast IP uses no centralized coordinator and anyone can send multicast messages to any multicast addresses. IGMP v3 adds the ability for a host that joins a multicast group to specify the source address (originator) of the multicast. The router will then not forward packets originating from unwanted source addresses onto the LAN.

IGMP allows edge routers (those routers connected to LANs) to be told what multicast groups the nodes on its connected LANs are interested in receiving PIM, Protocol Independent Multicast, is responsible for conveying multicast membership information among routers within the wide area Internet. It assumes the presence of other protocols to know the network topology and which routers are connected together. There are two basic approaches to multicasting on the WAN (wide-area network): dense mode (flooding) and sparse-mode multicast.

Dense Mode multicast, also known as flooding, originates from the multicast sender. The message is duplicated and sent to all connected routers. Each of those routers, in turn, duplicates and sends the message to all of its connected routers, and so on. To avoid routing loops, each router uses reverse path forwarding (RFP). A received packet is forwarded only if it was received via the link that the router knows is the shortest path back to the sender (it finds this by checking its forwarding table, which is what it would use if it was sending a packet to that address). PIM Dense Mode floods the entire network of connected multicast-aware routers.

If an edge router receives this multicast packet and is not interested the data stream (i.e., it has not received IGMP join messages), it will send a prune message to the router that delivered that packet. If that router receives prune messages from all interfaces, it will in turn send a prune message to the router that is sending it the multicast messages. A router sends prune messages if it is getting redundant traffic from another link or if its downstream router or LAN connections are not interested in the stream. If a node on a LAN joins a multicast group at a later time, sending an IGMP message to a router, that router would then send a PIM Graft message to its connected routers to state interest in the stream. Dense mode only makes sense when there are receivers spread through most locations covered my multicast-aware routers. It is rarely used.

In contradistinction to Dense Mode, PIM Sparse Mode starts with requests from multicast receivers rather than flooding the network with traffic from the sender. Each router must send a Join message to its connected routers in order to request multicast traffic (and a Prune message when it no longer is). This causes multicast packets to only go to the routers where it is needed. The trick to getting this to work is that a multicast group must be associated with a router known as a rendezvous point (RP). The RP acts as a central point that senders know how to contact to register that they are transmitting multicast streams and for receivers to contact to join multicast streams. Join messages initially are routed to the RP to avoid flooding the network. From there, they are routed to participating routers – routers that expressed an interest in that multicast group.

A sender that transmits multicast packets will simply have those packets routed only to the rendezvous point (this is effectively a unicast stream). The RP then multicasts the packet to any routers from which it has received Join messages.

To ensure that the RP does not become a bottleneck, after a the aggregate bandwidth exceeds a defined threshold, routers closer to the receiver will try to join routers that are more directly connected to the source since they have seen the multicast traffic and know the source address.

Sparse mode is ideal when the receivers are concentrated among a few network segments.

State Machine Replication and Virtual Synchrony

Go here for lecture notes

Goal: Create a software framework that gives everyone the same view of live group members and provides atomic multicasting of messages to all of these members. There should never be a case where only a subset of a group receives a message.

State Machine Replication

We often deploy distributed systems in an effort to achieve both high availability and high scalability. Availability refers to the fraction of time that the overall system is up and running, capable of doing whatever it is supposed to do. A highly availabile services means the service is designed to survive computer or network failures.

Scalability refers to the ability to handle an increasing number of requests by adding more servers to the system. Both of these goals are often achieved via redundancy; by adding replicated components to the system. These replicas can step in immediately when some components break, thus helping with availability.

Active-passive replication means the replicated systems do not process requests when the main system is up but are standing by in case it fails. Google’s Chubby lock manager/file system (we will study it later in the course) is an example of this, where requests must always go to the master. Any changes at the master get propagated (replicated) to the replicas. With active-passive systems, we get fault tolerance by having standby systems but achieve no scalability of performance since one process has to handle all requests.

Active-active replication means that the workload is distributed among the set of replicas, thus helping with scalability (each component only handles a fraction of the workload) as well as availability.

Faults

Faults may be fail-silent or byzantine. With fail-silent faults, the faulty process is dead: it does not accept or transmit communication. With byzantine faults, a process is running but producing erroneous data. Byzantine faults are more difficult to handle and, for our discussion, we will assume that our processes can detect faulty messages (e.g., checksums, digital signatures, encryption, and time stamps will be used and we’ll examine these techniques in a future lecture) and disregard them. With fail-silent faults, a process may remain dead. In that case, it exhibits fail-stop behavior. If a failed process recovers, it exhibits fail-recover behavior. Recovery is good, of course, but we need to realize that the process has not received any updates to the state of the group during the time it was dead.

The two-army problem (also known as the two generals problem) demonstrates that reliable communication can never be achieved with asynchronous and possibly unreliable communication lines. The story goes like this:

Two armies, A & B, need to decide whether to attack the enemy. If both attack, they win. If only one attacks, it will die. A sends a messenger to B: “let’s attack in the morning”. But A needs to know that B received the message so it asks for an acknowledgement as a return message. If the messenger bearing the acknowledgement does not arrive, A does not know if the return messenger didn’t make it and B got the message or if B never received the message. If A receives a response, B isn’t sure if the messenger made it to A or not. B can ask for an acknowledgement to the acknowledgement but then A will not be sure if the second acknowledgement made it to B. We can do this indefinitely…

In asynchronous networks, such as the IP network, there is no time limit by which one can reliably expect a packet to arrive. The two-army problem shows us that we can never achieve 100% certainty that a process is truly dead (fail-silent) just because we do not hear from it. This is something we have to live with. Because of this, one process cannot reliably detect that another one has failed. Instead, a process may suspect the another process has failed and share that knowledge with all other members of the group, taking the “failed” process out of the group even if it is actually alive. Virtual synchrony deals with managing this type of group-wide list of active group members. As with active-passive systems, changes at one process must be propagated to all others. However, consistency becomes a consideration: if all replicas do not get the updates at the same time (atomically), then there is a risk that clients may read inconsistent or old data depending on which server handles their requests.

For both scalability and high availability, all replicas should share the same software and same data. This data reflects the state of the system. It would be confusing if data in a file sometimes looks different because it is accessed from different servers that received write requests in a different order, or if only some systems were told that a lock has been released.

To accomplish this synchronization among replicas, processes across different systems must see the same sequence of events (events that modify the state of the system). These events can include lease/lock information, file or object writes, or any inputs into a process that is running on a machine.

Virtual synchrony

Virtual synchrony is a technique used to provide applications with the abstraction of multicasting a message to a group of processes and ensuring that all data is replicated to all processes in the group in such a manner that it is indistinguishable from using a single non-faulty system. This involves managing group membership and ensuring that causal dependencies on messages from multiple systems are taken into account.

Virtual synchrony focuses on process groups. Each process group is a collection of processes. When a process sends a message to the group, it must be received by all current group members. Processes may join groups, leave groups, and send messages to groups. The processes that send messages to the group are often, but not necessarily, outside of the group (e.g., think of systems communicating with a fault tolerant server).

A group view is the set of processes that are currently in a given process group. Every multicast is associated with a group view and this group view must be identical for every process that will send multicasts to the group.

The virtual synchrony framework tracks group membership. Over time, the membership of a process group may change. Processes may be added and processes might leave (e.g., a computer may crash). This change of membership is called a view change. Any processes in the system should be made aware of this change since future multicasts will have to be sent to the new group view. Message delivery should not span a view change. That is, if a view change takes place while a message was being multicast, that message should be delivered to all the processes in the group before the view change message is delivered to any member.

Although our discussion is general, we will focus on the Isis framework, which is deployed in places such as the New York Stock Exchange and the Swiss Exchange. Similar services are also used by IBM’s Distribution and Consistency Services, Microsoft’s Cluster Service, CORBA remote objects, and many others.

Group Membership Service

A Group Membership Service (GMS) keeps track of which processes are in the group. If a process, p, reports another process, q, as faulty, the GMS will take the faulty process out of the group and tell every process with a connection to q that q is no longer in the group. With asynchronous networks, we cannot be 100% sure that the process was really faulty; it might have just been slow to respond. However, if any process suspects that process is dead, we will just take it out of the process group and nobody will send messages to it. Once it is taken out of the group, it will have to rejoin just as if was a new process.

When a process joins a group, it first needs to import the current state of the synchronized group. This is called a state transfer. The process contacts an existing member and transfers the state to initialize itself to the most recent checkpoint. During this time, it is not processing any requests. After the state transfer, it is part of the group.

Reliability and message ordering

An enhanced version of Isis, Isis2, supports four types of multicast protocols depending on the order of delivery that is needed: unordered, causally ordered, total ordered, and sync ordered (barriers). Isis2 also includes a SafeSend API that is a full Paxos implementation, which is a relatively recent addition to the software (we will look at Paxos, a consensus protocol, in a future lecture).

Systems can think of the process group and virtual synchrony as one abstract entity. Isis uses reliable TCP point-to-point links and the client library iterates through group members to deliver the message to each member. Upon receiving a message, each receiver adds the message to its queue (but does not yet deliver to the application). At this point, the message is marked unstable at the receiver. The receiver then sends an acknowledgement to the sender. When the sender receives acknowledgements from all members, it sends a confirmation to all processes in the group, who then mark the message as stable and deliver it to the application. To improve performance, not every message is actually acknowledged. A receiver may collect a set of messages, each with a per-source sequence number and periodically send a cumulative acknowledgement containing the highest sequence number it received. Likewise, the sender can send a cumulative confirmation with a sequence number that tells each receiver that it is safe to deliver all messages with a lower sequence number. If a multicast sender dies before it is able to confirm delivery of messages to the entire group, these messages will remain in an unstable state at the receivers and are not delivered to applications. When the death of the sender is eventually detected (either by the Group Membership Service or by a process that then alerted the GMS) or a new member joins, the GMS instantiates a view change since the group membership has now changed. The view change will ensure that unstable messages are received by all current group members, thus enforincing atomic multicasting (the all or none property).

Processing a view change

If some process P receives a view change message because another process wants to join the group or a process has left the group, P will forward a copy of any unstable messages to every process in the group. This will ensure that the messages will become stable and all current group members will get to deliver them to their applications. Process P then needs to be sure that all group members are ready and have no unstable message of their own. To do this, P sends a flush message to the group. Upon receiving a flush message, a process will do the same thing that P did: multicast all unstable messages (if it has not done so yet) and then acknowledge the flush. When a process has received a flush message from every other process in the group, it knows that there are no unstable messages left in the system and can switch to the new group view. We thus achieve atomicity of multicasts. The flush is the implementation of a barrier: all outstanding messages must be delivered to applications before the barrier allows new message traffic to continue.