Clusters

Combining systems for high performance and/or high availability

Paul Krzyzanowski

April 30, 2021

Goal: Combine computers together to create high performing and/or highly reliable systems that provide users with a single system image.

Clustering is the aggregation of multiple independent computers to work together and provide a single system that offers increased reliability and/or performance. It is a realization of the single system image that we discussed at the start of the semester. Clusters are generally off-the-shelf computers that are connected to a local area network that allows them to communicate with other computers in the cluster. A cluster may be a collection of tens of thousands (or more) computers (e.g., a Google Cluster) or just a backup computer to take over for a failed web server or database.

In the general sense, clustering is about hiding distribution from end users and making it easy to deploy, execute, and track programs across a large collection of machines. Many frameworks have been developed to support this. Some have targeted only specific frameworks. For instance, Apache Hadoop MapReduce was developed with a master process that receives requests from a client and dispatches and tracks the execution of the job across the map and reduce workers, restarting any if needed, before returning a result to the client.

General purpose clustering frameworks are not tied to a specific computation framework. We will examine two such frameworks: YARN and Mesos.

Apache Hadoop YARN (Yet Another Resource Negotiator)

YARN is a key component of the Hadoop ecosystem, responsible for resource management and job scheduling. It manages the distribution of compute tasks across the Hadoop cluster.

YARN workflow

The YARN environment consists of one Master Node and many Data Nodes. The master node runs a Resource Manager. Each worker node runs a Node Manager.

Client Submission
The process begins with a client submitting an application (like a MapReduce job) to YARN, including the code and necessary metadata.
Resource Manager
The application is received by the YARN Resource Manager, the master daemon. The Resource Manager tracks resources for the cluster of machines. It consists of the scheduler, which allocates resources to applications, and the applications manager, which accepts or rejects job submissions and negotiates the initial container for the job. The Application Manager finds a Data Node that can handle the job and contacts the Node Manager at that node.
Node Manager Each worker node in the cluster runs a Node Manager, which has the role of allocating resources required to execute a job on that machine. It accomplishes this by creating a container, which is a designated environment for running a YARN application, essentially acting as a reserved pool of memory. For enhanced control and security, this container can be configured to be an isolated environment with set boundaries on its resource usage, achieved through Linux cgroups. Additionally, there is the option to configure the system to use Docker containers, which provide a more robust isolation mechanism, ensuring that YARN applications run in a strictly controlled environment.
Application Master
The Node Manager is responsible for initiating the Application Master inside a container. This Application Master is specific to the framework in use, such as a dedicated Application Master for MapReduce tasks. Its responsibilities encompass the execution and monitoring of all tasks associated with the job. The Application Master assesses the need for additional resources — for example, it determines if there are sufficient map and reduce workers available. Should there be a need for more resources, it communicates with the Resource Manager, submitting a Resource Request that identifies the required resource needs along with any location constraints, such as the need for proximity to data.

When additional tasks need to be spawned across multiple nodes as a result of this resource negotiation, each node will send regular heartbeats to the Application Master. This mechanism ensures that the Application Master remains updated on the task execution’s state and health across the various nodes. Furthermore, the Application Master is charged with the role of reporting the status of the job execution back to the Resource Manager, thus maintaining a coherent flow of job execution information within the YARN infrastructure.

Distribution and Execution
The code (stored in HDFS) is retrieved by Node Managers and executed in allocated containers. The code is passed as a pathname in HDFS.
Data Locality Optimization
YARN optimizes for data locality, attempting to execute tasks on nodes where data is already present.
Monitoring and Reporting
The Application Master monitors task progress and reports back to the Resource Manager. It can request new containers for retrying failed tasks.

Apache Mesos

Apache Mesos is a cluster manager designed to handle workloads in distributed environments. It uses dynamic resource sharing and isolation, making it ideal for large-scale clusters. Mesos consolidates node resources into a single pool, known as node abstraction, streamlining the allocation process and improving resource utilization.

Key Components of Mesos Architecture

Mesos Master

The Mesos Master is central to the cluster’s high availability, hosting the user interface that displays available resources and managing all data related to task execution.

Mesos Agent

The Mesos Agent is responsible for container management, acting as a liaison between the executor and the Mesos Master, and ensuring task status updates are communicated to the schedulers.

Mesos Framework

The Mesos Framework includes the Scheduler and the Executor. The Scheduler registers with the Mesos Master, launching tasks as resource offers match requirements. The Executor executes these tasks and reports their status.

Operational Workflow

An agent announces its available resources to the Mesos Master, which then informs the frameworks. Frameworks specify tasks and resource requirements, prompting the Master to assign tasks to agents, which allocate resources to the executor.

Mesos’ Unification Approach

Mesos is unique in its approach to resource management, aggregating multiple physical resources into a single virtual resource. This contrasts with virtualization’s division of one physical resource into multiple virtual ones, providing more flexibility and efficiency.

Cluster architectures

There are two focused classes of cluster architectures:

Supercomputing: Also known as high-performance computing, or HPC. The goal of an HPC cluster is to create a computing environment that resembles that of a supercomputer.
High Availability (HA): The goal in this cluster it to ensure maximum availability by providing redundant systems for failover.

We also have two additional forms of clustering that are widely used:

Load Balancing: A load balancing cluster distributes requests among a collection of computers. In doing so, it addresses both scalability and high availability.
Storage: Storage clustering is a way to ensure that systems can all access the same storage. It is also a way to make vast amounts of storage available to a computer without having to put it inside the computer (where it will not be available if the computer fails).

The boundaries between these cluster types are often fuzzy and many clusters will use elements of several of these cluster types. For example, any cluster may employ a storage cluster. High availability is important in supercomputing clusters since the likelihood of any one computer failing increases with an increasing number of computers. There is no such thing as a “standard” cluster.

Cluster components

A cluster needs to keep track of its cluster membership: which machines are members of the cluster. Related to this are the configuration and service management components. The configuration system runs on each node (computer) in the cluster and manages the setup of each machine while the service management component identifies which nodes in the cluster perform which roles (e.g., standby, active, running specific applications).

A quorum is the number of nodes in a cluster that have to be alive for the cluster to function. Typically, a majority is required. This provides a simple way of avoiding split-brain due to network partitioning where one group of computers cannot talk to another and two instances of the cluster may be created. With a majority quorum, a minority of systems will never create their own cluster.

A cluster interconnect is the network that allows computers in a cluster to communicate with each other. In most cases, this is just an Ethernet local area network. For performance, bandwidth and latency are considerations. Communicating outside of a rack incurs longer cable runs and the overhead of an extra switching stage. Communicating outside of a data center incurs even longer latency. For maximum performance, we would like computers that communicate frequently to be close together physically. However, for maximum availability, we would like them to be distant. If a rack switch or an entire data center loses power, it would be good to have a working replica elsewhere. For high performance applications within a local area, a dedicated network is often used as a cluster interconnect. This is known as a System Area Network (SAN). A high-performance SAN will provide low latency, highly reliable, switched communication between computers. By using a SAN, the software overhead of having to run the TCP/IP stack, with its requisite fragmentation, buffer management, timers, acknowledgments, and retransmissions, is largely eliminated. Remote DMA (RDMA) allows data to be copied directly to the memory of another processor. SANs are often used for HPC clusters, with SAN/RDMA communication incorporated into the Message Passing Interface (MPI) library, which is commonly used in high performance computing applications. Examples of SAN interconnects are Infiniband, Myrinet, and 10 Gbps ethernet with Data Center Bridging. They are generally used to connect a relatively small number of computers together.

A heartbeat network is the mechanism that is used to determine whether computers in the cluster are alive or dead. A simple heartbeat network exchanges messages between computers to ensure that they are alive and capable of responding. Since a local area network may go down, one or more secondary networks are often used as dedicated heartbeat networks in order to distinguish failed computers from failed networks. Asynchronous networks, such as IP, make the detection of a failed computer problematic: one is never certain whether a computer failed to send a message or whether the message is delayed beyond a timeout value.

Storage clusters

Storage in a clustered computer system can be provided in a variety of ways. Distributed file systems, such as NFS, SMB, or AFS can be used. These provide file-level remote access operations over a network.

A Storage Area Network (SAN, not to be confused with a System Area Network) is a dedicated network for connecting computers to dedicated disk systems (storage arrays). Common SAN interconnect technologies include iSCSI, which uses the SCSI protocol over the Ethernet, and Fibre Channel. Computers access this remote storage at the block level (read a specific block, write a specific block), just like they would access local storage. With a SAN, however, access to the same storage can be shared among multiple computers. This environment is called shared disk. A distributed lock manager, or DLM, manages mutual exclusion by controlling access to key resources on the shared disk so that, for example, two computers will not try to write to the same disk block at the same time. A clustered file system is a file system that is built on top of a shared disk. Unlike a distributed file system (NFS, SMB, et al.), which uses remote access at a file level, each computer’s operating system implements a full file system and makes requests at the block level. Examples of such file systems include the Oracle Cluster File System for Linux (OCFS2), Red Hat’s Global File System (GFS2), and Microsoft’s Cluster Shared Volumes (CSV). The DLM is used to ensure that critical shared file system data structures, such as bitmaps of free blocks, inode structures, and file lock tables, are accessed exclusively and caching is coherent. It operates at the level of the implementation of a file system rather than high-level file systems services as in distributed file systems. As such, it differs from something like the NFS lock daemon, which kept track of file locks requested by applications rather than block-level locks needed to keep a file system coherent.

A shared nothing cluster architecture is one where each system is independent and there is no single point of contention in the system, such as competing for access to a shared disk. Because there is no contention, there is no need for a DLM. In this environment, any data that is resident on a system’s disk can only be obtained by sending a request to the computer that owns the disk. If the computer dies, the data is generally unavailable but may be replicated on other nodes. An alternative design that uses a SAN can allow disk access to be switched to another computer but ensure that only one computer accesses the file system at any time.

To make disks themselves highly available, RAID (redundant array of independent disks) is often employed. RAID 1 is disk mirroring. Anything that is written to one disk gets written to a secondary disk. If one fails then you still have the other. RAID 5 and RAID 6 stripes the data across several disks and also adds in error correcting codes so that it data could be reconstructed from the available segments if one would die (e.g., parity to allow recovering data lost if one disk fails).

High-Performance Computing (HPC)

High-performance clusters (HPC) are generally custom efforts but there are a number of components that are common across many implementations. HPCs are designed for traditional supercomputing applications that focus on a large amount of computation on large data sets. These applications are designed to be partitioned into multiple communicating processes. The Message Passing Interface (MPI) is a popular programming interface for sending and receiving messages that handles point-to-point and group communication and provides support for barrier-based synchronization. It is sometimes used together with the Parallel Virtual Machine (PVM), a layer of software that provides an interface for creating tasks, managing global task IDs, and managing groups of tasks on arbitrary collections of processors. PVM is in many ways similar to MPI but designed to be more dynamic and support heterogenous environments. However, its performance was not up to the levels of MPI and its popularity is waning. Beowulf and Rocks Cluster are examples of HPC clusters based on Linux. Microsoft offers high performance clustering via the Microsoft HPC Pack. There are many other HPC systems as well. The common thread among them all is that they provide a front-end server for scheduling jobs and monitoring processes and offer an MPI library for programming.

Batch Processing: Single-Queue Work Distribution

Single queue work distribution is a form of high performance computing that does not rely on communication between computing nodes. Where traditional HPC applications usually involve large-scale array processing and a high level of cooperation among processing elements, the work distribution approach is used for applications such as render farms for computer animation, where a central coordinator (dispatcher) sends job requests to a collection of computers. When a system completes a job (e.g., “render frame #4,178”), the dispatcher will send it the next job (e.g., “now render frame #12,724”). The dispatcher will have the ability to list jobs, delete jobs, dispatch jobs, and get notified when a job is complete. The worker nodes have no need to communicate with each other.

Load Balancing

Web-services load balancing is a somewhat trivial but very highly used technique for distributing the load of many network requests among a collection of computers, each of which is capable of processing the request. Load balancing serves three important functions:

Load balancing. It enables scalability by distributing requests among multiple computers.
High availability (failover). If a computer is dead, the requests will be distributed among the remaining live computers.
Planned outage management. If a computer needs to be taken out of service temporarily (for example, to upgrade software or replace hardware), requests will be distributed among the remaining live computers.

The simplest form of load balancing is to have all requests go to a single computer that then returns an HTTP REDIRECT error. This is part of the HTTP protocol and will lead the client to re-issue the request to the computer specified by the REDIRECT error.

Another, and the most popular approach, is to use a load-balancing router to map incoming requests to one of several multiple back-end computers.

For load balancing across data centers, DNS-based load balancing may be used where a DNS query returns IP addresses of machines at different data centers for domain name queries.

High Availability

High-availability clusters strive to provide a high level of system uptime by taking into account the fact that computers may fail. When this happens, applications running on those computers will resume on other computers that are still running. This is called failover.

Low-level software to support high-availability clustering includes facilities to access shared disks and support for IP address takeover, which enables a computer to listen on multiple IP addresses so that IP packets that were sent to a failed machine can reach the backup system instead.

Mid-layer software includes distributed elections to pick a coordinator, propagation of status information, and figuring out which systems and applications are alive. Higher-layer software includes the ability to restart applications, let a user assign applications to computer, and let a user see what’s going on in the system as a whole.

An active/passive configuration is one where one or more backup (passive) systems are waiting to step in for a system that died. An active/active configuration allows multiple systems to handle requests. Requests may be load balanced across all active systems and no failover is needed; the dead system is simply not sent any requests.

Failover can be implemented in several ways:

Cold failover: This is an application restart — the application is started afresh from the beginning. An example is starting up a web server on a backup computer because the primary web server died. There is no state transfer.
Warm failover: Here, the application is checkpointed periodically. It can then be restarted from from the last checkpoint. Many cluster libraries provide the ability for a process to checkpoint itself (save its memory image). Pregel is an example of a software framework that relies on periodic checkpointing so that a graph computation does not have to restart from the beginning.
Hot failover: Here, a replica application is always kept synchronized with the active application on another computer. An example of this is a replicated state machine. Chubby servers, for example, implement hot failover: if the Chubby master fails, any other machine in the Chubby cluster can step in.

Cascading failover refers to the ability of an application to fail over even after it already has failed over in the past. Multi-directional failover refers to the ability to restart applications from a failed system on multiple available systems instead of a specific computer that is designated for use as a standby system.

An annoying malfunction is a Byzantine failure. In this case, the failed process or computer continues to communicate but communicates with faulty data. Related to this is the problem of fail-restart behavior, where a process may restart but not realize that the data it is working with is obsolete (e.g., a transaction coordinator might restart and not realize that the transaction has already been aborted). Fencing is the use of various techniques to isolate a node from the rest of the cluster. Power fencing shuts off power to the node to ensure that the node does not restart. SAN fencing disables the node from accessing shared storage, avoiding possible file system corruption. Other fencing techniques may block network messages from the node or remove processes from a replication group (as done in virtual synchrony).