Clusters are groups of independent machines that cooperate to look like one larger system. The goals fall into a small set of categories: surviving failures, finishing one big computation faster, spreading user requests across many servers, pooling storage, and pooling compute for a scheduler to assign. Modern production clusters combine several of these goals on the same hardware. Three eras of cluster design are worth recognizing: HPC clusters built from commodity machines (Beowulf), web-era fleets that prioritized availability at scale, and AI training clusters that look more like supercomputers because of the synchronous, all-to-all nature of model training.
The commodity hardware bet
Google’s design philosophy, described in the 2003 Web Search for a Planet paper by Barroso, Dean, and Hölzle, set the direction for modern data centers: build clusters from large numbers of ordinary PCs rather than a few expensive servers. The figures of merit are performance per dollar and performance per watt, not single-machine peak performance. The consequence is that hardware failure is normal at scale, so software is designed to detect and mask it rather than relying on enterprise-grade components to keep machines alive.
Cluster types
Five kinds of clusters appear repeatedly. Know what each one is for.
A high availability (HA) cluster masks machine failures from clients using a primary and one or more standby nodes with shared or replicated state. A high performance computing (HPC) cluster makes a single large computation finish faster by running tightly coupled tasks across many nodes connected by a fast interconnect. A load-balancing cluster fronts a fleet of identical servers with a dispatcher that spreads requests. A storage cluster pools disks across many machines into a single namespace, replicating or erasure-coding data so failures do not lose it. A scheduling (or batch) cluster pools the entire fleet’s CPU, memory, disk, and network and accepts work submissions that the scheduler assigns to machines.
Single system image
A cluster aims to present a single system image: one logical resource even though many physical machines are involved. Submitting a job to Kubernetes, opening https://example.com, and listing files in a distributed storage namespace are all examples. The illusion is maintained by software that hides which machine is doing the work.
Cluster networking
Two facts about cluster networks dominate the design.
Servers are organized in racks of around forty machines. Each rack has a top-of-rack (ToR) switch that connects every server in the rack and uplinks to the rest of the data center. Above the ToR switches sits a spine layer, with every ToR (acting as a leaf) connected to every spine switch. This is called a spine-leaf or Clos topology. Any two servers in different racks are exactly two hops apart, and every spine offers an equally good path, so the network can spread flows evenly and avoid hot spots.
The relevant capacity metric is bisection bandwidth: the bandwidth available across an imaginary cut that splits the cluster in half. A spine-leaf design scales bisection bandwidth by adding spines, without changing the leaves.
Server-to-server (east-west) traffic now dominates the network, not request-and-response (north-south) traffic between the cluster and the outside world. A single inbound user request can fan out to hundreds of internal calls, and the network is built for that pattern.
High-speed interconnects
Standard Ethernet is too slow for the inner loop of HPC and large machine learning workloads. Be familiar with these technologies and what each one does.
-
NIC offloads: move selected packet-processing work from the CPU to the NIC. Common examples are checksum offload, TCP segmentation offload, receive-side scaling, and large receive offload. They reduce CPU overhead without replacing the OS TCP/IP stack.
-
RDMA (Remote Direct Memory Access): one machine reads or writes another machine’s memory without involving the remote CPU, operating system, or TCP/IP stack. Microsecond latency, near-zero CPU cost.
-
InfiniBand: a separate networking technology designed for low-latency, high-throughput cluster communication; supports RDMA natively. Common in HPC and AI training.
-
RoCE (RDMA over Converged Ethernet): RDMA semantics on Ethernet. Ordinary Ethernet is lossy, and packet loss badly disrupts RDMA latency. RoCE deployments therefore configure the fabric to behave as nearly lossless using Priority Flow Control (PFC), which lets switches pause selected traffic classes instead of dropping packets, often combined with ECN.
-
NVLink and NVSwitch: scale-up GPU interconnects with much higher bandwidth and lower latency than PCIe. In many systems they connect GPUs inside one server; in newer rack-scale systems NVLink Switch extends the fabric across multiple boards or nodes inside a tightly coupled GPU rack. They are distinct from the scale-out Ethernet or InfiniBand network that connects servers across racks.
-
400 Gbps and 800 Gbps Ethernet: now common at the spine and ToR layers.
Failure handling
Three problems matter for cluster availability.
Failure detection uses heartbeats at multiple levels (link, IP, application) and consults more than one peer. Tuning the timeout balances quick reaction against false positives.
Failover transfers work from a failed primary to a standby. The hardest part is fencing, which forcibly stops a suspected-dead primary before the standby starts serving, to prevent both nodes from writing simultaneously and corrupting state.
Quorum prevents split brain. Recall from the Paxos and Raft lecture that a system using consensus requires a majority of nodes to agree before committing. The same idea applies here: when a network partition splits the cluster, only the side with a majority can make progress. The minority side cannot elect a leader and cannot do harm. Cluster components that need consensus run with an odd number of replicas (usually three or five) to give a clean majority.
Borg
Borg is Google’s cluster management system, running since the early 2000s and described publicly in 2015. Most of the modern cluster manager vocabulary comes from Borg.
A Borg cell is on the order of ten thousand machines managed as one resource pool by a single replicated control plane. The Borgmaster is replicated across five machines using Paxos and is the cell’s brain: it accepts user submissions and decides where work runs. A borglet runs on every machine in the cell, starts and stops tasks locally, and reports machine status to the Borgmaster. Link shards sit between the Borgmaster and the borglets and fan out updates to keep the master from becoming a network bottleneck.
A user submits a job, which is a set of identical tasks. A task is one process or process group inside a container. An alloc is a reserved slice of a machine’s resources in which one or more related tasks can run together; the reservation outlives any single task, so a restarted task can reuse its local state, and helper tasks can share the alloc with a primary task. The Kubernetes pod is the direct descendant of this idea. An alloc set groups allocs across many machines.
Each task has a priority. Borg organizes priorities into bands corresponding roughly to monitoring, production, batch, and best-effort. Quota is orthogonal: each team has a quota for how much they may consume in each band. Priority decides who wins a contest for resources; quota decides who is allowed to enter the contest. Higher-priority tasks may preempt lower-priority ones when capacity is scarce.
The scheduler runs in two steps. Feasibility checking filters out machines that cannot run the task (insufficient resources, missing devices, policy restrictions). Scoring ranks the surviving machines and picks the best, using a hybrid score that combines leftover capacity, expected actual usage, and the cost of preempting other tasks.
Tasks are isolated using Linux cgroups (control groups), which apply CPU shares, CPU quotas, memory limits, disk I/O limits, and network bandwidth limits per task. Filesystem isolation comes from chroot and namespaces; network isolation comes from a private network namespace per task. CPU is time-shared, so a task that exceeds its share runs slower. Memory is space-shared, so a task that exceeds its memory limit triggers the kernel’s out-of-memory killer, which terminates the task.
Resource reclamation addresses the gap between requested and actually used capacity. Users over-request for safety; Borg measures actual usage and offers the slack to lower-priority tasks willing to be preempted if the original task’s demand rises. This raises utilization without harming high-priority workloads.
The Borgmaster’s state is persisted in a Paxos-replicated store across the five replicas. Borglet state is not replicated because it is reconstructable: a restarted borglet rescans its machine and re-reports running tasks.
The lessons that carry forward are: declarative configuration scales better than imperative control (describe desired state, let the system reconcile), mixing latency-sensitive and batch workloads on shared machines raises utilization, and a strongly central control plane with replicated state scales further than the textbook intuition suggests.
Kubernetes
Kubernetes is the open-source system most directly influenced by Borg, released in 2014 by people who had worked on Borg and Omega. It carries Borg’s declarative model out of Google and onto any infrastructure.
The main workload abstractions are the pod (one or more containers sharing a network namespace, IP, and storage; the descendant of Borg’s alloc), the deployment (a desired number of identical pods, with rolling update support), and the service (a stable virtual IP and DNS name routed to a dynamic set of pods, with the EndpointSlice controller keeping the backing addresses current).
Service discovery is the bridge between the workload abstractions and clients. A pod should not contain hard-coded backend IP addresses, because pods come and go. Kubernetes provides stable service names through DNS (such as orders.default.svc.cluster.local), and the cluster keeps the name-to-endpoint mapping current. Applications address logical services, not individual machines.
The control plane has a small number of components.
-
API server: the only component that reads or writes persistent cluster state; every other component talks to it.
-
etcd: a Raft-replicated key-value store that holds all cluster state. Plays the role inside Kubernetes that Chubby plays inside Google.
-
Scheduler: assigns pending pods to nodes, using feasibility filtering and scoring.
-
Controller manager: runs the controllers that reconcile desired state with actual state for deployments, nodes, endpoints, and other resources.
-
Cloud controller manager: bridges to a specific cloud provider for load balancers, block storage, and node metadata.
Each worker node runs three components.
-
Kubelet: per-node agent; watches the API server for assigned pods, asks the runtime to start them, reports status.
-
Container runtime: actually runs containers (containerd or CRI-O), exposing the Container Runtime Interface (CRI) to the kubelet.
-
Kube-proxy: programs node networking, commonly through iptables or IPVS, so that requests to a service’s virtual IP reach one of the backing pods with no user-space proxying on the data path.
Machine learning clusters
AI training is one large distributed computation: thousands of GPUs repeatedly compute partial results, exchange data, and synchronize before the next step. The dominant communication pattern is all-reduce: every GPU contributes a partial gradient at the end of every step and waits for the global sum before continuing. Three implications follow. The network must be fast and engineered to avoid packet loss on the training path (RDMA over InfiniBand or RoCE, 400 or 800 Gbps links). GPUs in the same server use NVLink and NVSwitch for much higher intra-server bandwidth than the data center network can provide. Scheduling is gang scheduled: either every GPU the job needs is available at once, or the job does not start. Long training runs survive failures via periodic checkpoints rather than per-task restart.
Load balancing
Once a service runs on many backends, requests need to be spread across them. Three things to remember.
Layer 4 balancers operate on TCP or UDP connections and see only IP addresses and ports. They are fast, protocol-agnostic, and used at the edge. Layer 7 balancers parse HTTP and route on path, header, or cookie. To route HTTPS traffic they must perform TLS termination: decrypt the request at the balancer (which holds the service certificate and key), make the routing decision, and open a fresh connection to the backend. The common pattern is L4 in front of an L7 ingress.
The standard algorithms are round robin (each backend in turn), least connections (the backend with the fewest active connections), power of two choices (pick two at random, send to the less loaded), and consistent hashing (recall from the CDN lecture; hash backends and requests onto a ring). Power of two choices gives near-optimal load with very little state.
At larger scale, requests are steered across data centers. Anycast and GeoDNS (both covered in the CDN lecture) are the standard tools. Anycast advertises the same IP from many data centers and lets BGP route each client to the closest. GeoDNS and DNS-based load balancing return different IP addresses to clients in different regions, sending each to the closest cluster.
Session affinity routes a client’s requests to the same backend for the lifetime of a session, usually by hashing the client’s IP or by setting a cookie. Affinity simplifies in-memory session state but complicates failover. The modern alternative is to keep the service stateless and put session state in a shared store such as Redis.
What you don’t need to study
You do not need to memorize:
-
The author lists of the Borg or Google cluster architecture paper, or the year Beowulf was built and by whom.
-
Specific examples used to illustrate cluster categories, such as HDFS, Ceph, or S3 as storage clusters.
-
The exact size of a Borg cell, the number of replicas used by the Borgmaster or by etcd, or the rough number of machines per rack.
-
Borg’s internal subdivisions beyond jobs, tasks, allocs, and pods; link shards and alloc sets are background.
-
The exact list of signals used by Borg’s scheduler when scoring candidate machines.
-
The specific names of NIC offloads (checksum offload, TCP segmentation offload, receive-side scaling, large receive offload).
-
The values of Borg’s priority bands; the band names (monitoring, production, batch, best-effort) and what preemption does are enough.
-
Any specific link speeds quoted for modern data center networks (400 Gbps, 800 Gbps).
-
The exact mechanisms used for filesystem and network isolation (chroot, Linux namespaces); knowing that isolation exists and that cgroups enforce CPU and memory limits is enough.
-
Whether kube-proxy uses iptables or IPVS, or whether the container runtime is containerd or CRI-O.
-
Specific Kubernetes object names beyond pod, deployment, and service (such as EndpointSlice), or the exact form of cluster DNS names.
-
The acronyms PFC and ECN beyond knowing that RoCE depends on a near-lossless fabric.
-
The names of Borg’s contemporaries (Mesos, YARN, Nomad, Omega) or the analogy between etcd and Chubby; these are context.
-
Any mathematical bound for power-of-two-choices load balancing; the algorithm and the fact that it performs close to picking the best of all backends are enough.
-
The internals of Raft or Paxos beyond the consensus lecture.
Focus your study on why each component exists, what problem it solves, and how it interacts with the others.