Clusters

Cluster Categories and Motivation

Cluster: A group of independent computers that cooperate so closely that, from the outside, they look like a single system.
Single system image: The illusion that a user or operator sees one logical resource even though many physical machines back it.
High availability cluster (HA): A cluster built to mask machine failures from clients, typically using a primary that serves requests and a standby that takes over on failure.
High performance computing cluster (HPC): A cluster built to make a single large computation finish faster, with nodes cooperating tightly through a fast interconnect.
Load-balancing cluster: A cluster of identical servers fronted by a dispatcher that spreads incoming requests across the fleet.
Storage cluster: A cluster that pools disk capacity from many machines into a single namespace, with replication or erasure coding so that disk and machine failures do not cause data loss.
Scheduling cluster: A cluster that treats the whole fleet as a pool of resources and accepts work submissions from many users at once, with a scheduler deciding which job runs on which machine.
Beowulf: The original commodity-PC scientific cluster, built at NASA in 1994, that established the idea of running parallel scientific code on ordinary hardware connected by Ethernet.

The Commodity Hardware Bet

Commodity hardware: Ordinary, mass-produced server components used in place of expensive specialized hardware, on the assumption that software will compensate for the lower per-machine reliability.
Performance per dollar: A figure of merit that measures how much useful work a system delivers for each unit of cost; it favors many small machines over a few fast ones when the workload can be split.

Cluster Networks

Top-of-rack switch (ToR): The switch at the top of a server rack, connected by short copper cables to every server in the rack and uplinked to the rest of the cluster network.
Spine-leaf fabric: A two-level switching topology, also called a Clos fabric, in which every leaf switch connects to every spine switch so that any two racks are exactly two hops apart.
Leaf switch: In a spine-leaf fabric, the lower-level switch (a top-of-rack switch) that connects directly to servers and uplinks to every spine.
Spine switch: In a spine-leaf fabric, the upper-level switch that connects every leaf to every other leaf.
Bisection bandwidth: The minimum bandwidth across any cut that splits the network into two equal halves; it bounds the worst-case capacity available for all-to-all communication.
East-west traffic: Traffic between servers inside a cluster, generated when a single user request fans out into many internal service calls.
North-south traffic: Traffic that enters or leaves the data center, between an external client and a front-end server.
NIC offload: A feature that moves selected packet-processing work from the CPU to the network interface card, reducing CPU overhead without replacing the operating system’s TCP/IP stack.
Remote direct memory access (RDMA): A technology that lets one machine read or write a region of another machine’s memory without involving the remote CPU, operating system, or TCP/IP stack, by having the NIC place data directly into user buffers.
InfiniBand: A separate networking technology designed for low-latency, high-throughput cluster communication, with native support for RDMA and very small switching latency.
RoCE (RDMA over Converged Ethernet): A protocol that brings RDMA semantics to Ethernet and depends on the fabric being configured as nearly lossless.
Priority Flow Control (PFC): A switch mechanism that pauses selected traffic classes instead of dropping packets when buffers fill, used to make Ethernet behave as a nearly lossless fabric for RoCE.
NVLink: A scale-up GPU interconnect that connects GPUs with much higher bandwidth and lower latency than PCIe.
NVSwitch: A switch fabric for NVLink that lets many GPUs in a server, or in a tightly coupled rack, communicate as if they were on one fabric.

High Availability and Failure Handling

Heartbeat: A small message that a machine sends to peers at a fixed interval; missing some number of consecutive heartbeats is treated as evidence that the sender has failed.
Failover: The mechanism by which a standby takes over from a failed primary, including any work needed to claim resources, replay logs, or elect a new leader.
Fencing: The act of forcibly stopping a suspected-dead primary, by cutting power, disabling storage access, or revoking a lease, before the standby is allowed to take over.
Quorum: A majority of nodes in a replicated system whose agreement is required to commit any change; it prevents both halves of a partitioned cluster from making progress at the same time.
Split brain: A condition in which a network partition leaves two halves of a cluster each believing it should take over, with both accepting writes; quorum prevents it.

Borg

Cell: A unit of Borg deployment consisting of on the order of ten thousand machines in a single data center, managed as one resource pool.
Borgmaster: The replicated control plane component of a Borg cell that holds cell state, accepts work from users, and decides where each task should run.
Borglet: The local agent that runs on every machine in a Borg cell to start and stop tasks, monitor their resource usage, and report machine status.
Link shard: A Borg component between the Borgmaster and the borglets that fans out updates to a slice of the cell, keeping the master from being a network bottleneck.
Job: A user-submitted unit in Borg consisting of a collection of identical tasks with shared resource requests.
Task: One process or process group running inside a container as part of a Borg job.
Alloc: A reserved slice of a machine’s resources in which one or more related tasks can run together; the reservation outlives any individual task.
Alloc set: A group of allocs across many machines, treated as a unit in the same way that a job groups tasks.
Priority: An integer attached to a Borg task that determines whether it can preempt other tasks or is itself preempted when resources are scarce.
Priority band: A named group of priorities, such as monitoring, production, batch, or best-effort, that captures the workload’s expected behavior and protections.
Preemption: The killing of a lower-priority task to free resources for a higher-priority one, with the preempted task returned to the pending queue.
Quota: A per-user or per-team limit on how much of each priority band may be consumed; it controls who is allowed to enter a contest for resources.
Feasibility checking: The first step of Borg scheduling, in which machines that cannot satisfy a task’s requirements are filtered out.
Scoring: The second step of Borg scheduling, in which feasible machines are ranked using a hybrid score that combines leftover capacity, the cost of preemption, and predicted actual usage.
Resource reclamation: The Borg mechanism that measures the gap between requested and used resources and offers the difference to lower-priority work, raising overall utilization.
cgroups (control groups): A Linux kernel mechanism that places each task into a control group and enforces CPU, memory, disk I/O, and network bandwidth limits on that group.
Out-of-memory killer: The Linux kernel mechanism that terminates a task whose memory use exceeds its limit, since memory cannot be slowed the way CPU can.

Kubernetes

Declarative configuration: A model in which the user describes the desired state of the system, and a control loop works to make actual state match.
Controller: A background loop that watches a piece of cluster state, compares desired state to actual state, and takes action to close the gap.
Reconciliation loop: The continuous observe-diff-act cycle that a controller runs to drive actual state toward desired state.
Pod: The smallest deployable unit in Kubernetes, consisting of one or more containers that share a network namespace, an IP address, and storage volumes.
Deployment: A Kubernetes object that describes a set of identical pods and a desired count, with a controller that creates and destroys pods to maintain the count and to perform rolling updates.
Service: A Kubernetes object that exposes a stable virtual IP and DNS name routing to the current set of pods backing it, decoupling clients from specific pod addresses.
Service discovery: The mechanism by which a stable name is resolved to the current set of healthy endpoints, so clients can address logical services without tracking individual machines.
API server: The only Kubernetes component that reads or writes persistent cluster state; every other component reaches state by talking to it.
etcd: The Raft-replicated key-value store behind the Kubernetes API server, holding all cluster state.
Scheduler (Kubernetes): The component that watches for unassigned pods, filters and scores nodes, and writes the chosen node back to the pod’s record.
Controller manager: The Kubernetes process that runs many controllers (deployment, node, endpoint, and others), each implementing one reconciliation loop.
Kubelet: The per-node Kubernetes agent that watches the API server for pods assigned to its node, asks the container runtime to start them, and reports status.
Container runtime: The software on each node that pulls images, sets up cgroups and namespaces, and starts container processes; it exposes a standard interface, the CRI, to the kubelet.
kube-proxy: The Kubernetes component that programs each node’s networking so that requests to a service’s virtual IP are forwarded to one of the pods behind the service.
TLS termination: The point at which an encrypted client connection ends, allowing a load balancer to read the request, make a routing decision, and open a separate connection to the backend.

Machine Learning Clusters

All-reduce: The communication pattern in which every participant computes a partial result and all participants exchange and combine these results before the next step begins.
Gang scheduling: A scheduling discipline in which all of a job’s components must be available at the same time, or none of them run; required for tightly coupled workloads such as model training.
Checkpoint: A periodic snapshot of a long-running computation’s state, written so that a failure can restart the job from the most recent snapshot rather than from the beginning.
Data-parallel training: A training strategy in which every worker holds a copy of the model and processes a different shard of the data, exchanging gradients between steps.
Model-parallel training: A training strategy in which the model itself is split across workers, with intermediate activations or partial results moving between them during each step.

Load Balancing

Load balancer: A component that spreads incoming requests across a fleet of identical backends, removes failed backends from rotation, and sometimes routes by client locality.
Layer 4 load balancer (L4): A load balancer that operates on TCP or UDP connections using IP addresses and ports, without parsing the request payload.
Layer 7 load balancer (L7): A load balancer that understands the application protocol (almost always HTTP), allowing it to route by URL, header, or cookie and to perform retries and rewrites.
Round robin: A load-balancing algorithm that picks each backend in turn; it gives an even distribution when backends and requests are equal but does not adapt to load.
Least connections: A load-balancing algorithm that picks the backend with the fewest active connections, adapting to backends of different speeds and to long-lived connections.
Power of two choices: A load-balancing algorithm that picks two backends at random and sends the request to the less loaded of the two, performing close to picking the best of all backends without tracking every backend’s load.
Consistent hashing: A routing technique that hashes both backends and requests onto a ring and assigns each request to the next backend clockwise, so adding or removing a backend moves only its share of requests.
Session affinity: A routing rule, also called sticky sessions, that sends a client’s requests to the same backend for the lifetime of a session, usually by hashing the client’s IP or by setting a cookie.

Cross-Cluster Distribution

Anycast: The practice of advertising the same IP address from many data centers, with BGP routing each client to whichever data center is closest by network distance.
GeoDNS: A DNS-based technique that returns different IP addresses to clients in different regions, sending each to the closest cluster’s load balancer.
DNS-based load balancing: A coarse balancing technique in which a DNS server hands out backend IP addresses in a controlled rotation; cheap and protocol-agnostic, but slow to react to backend changes because of DNS caching.

Clusters

Cluster Categories and Motivation

The Commodity Hardware Bet

Cluster Networks

High Availability and Failure Handling

Borg

Kubernetes

Machine Learning Clusters

Load Balancing

Cross-Cluster Distribution

Back to CS 417 Documents