Clusters

Clusters are groups of independent machines that are managed so they behave like one larger system. The goal is not to make hardware stop failing. The goal is to use software to detect failures, move work, replicate state, and keep the service usable.

A cluster gives users and applications a single system image. A client sees one service name, one virtual address, or one storage namespace, even though many machines may be doing the work behind it.

The commodity hardware idea

Modern clusters are usually built from many ordinary servers rather than a few expensive machines. Adding capacity means adding more nodes.

This design works because many distributed workloads can be split across machines. The tradeoff is that hardware failure becomes normal. At large scale, disks fail, machines crash, switches break, and network links have problems. Cluster software is built with that assumption.

The main lesson is that reliability comes from software mechanisms such as replication, redundancy, failure detection, failover, and coordination.

Cluster types

High availability (HA) cluster: A cluster designed to keep a service usable when machines or processes fail.
High performance computing (HPC) cluster: A cluster designed to run one large, tightly coupled computation across many nodes.
Load-balancing cluster: A cluster that spreads client requests across a group of interchangeable servers.
Storage cluster: A cluster that combines storage from many machines into a larger storage service or namespace.
Scheduling cluster: A cluster that accepts work from many users or applications and decides which machine should run each task.

Real systems often combine these roles. For example, a Kubernetes cluster schedules work, runs replicated services, depends on storage systems, and uses load balancers to route traffic.

Single system image

A single system image is the illusion that many machines behave like one system. A user submits a job, calls a service, or accesses a storage namespace without choosing a particular machine.

This abstraction is created by software: schedulers choose machines, load balancers route requests, service discovery maps names to current backends, and storage systems hide where data is stored.

The abstraction is useful because machines can be added, removed, replaced, or restarted without changing how clients address the service.

Cluster networking

Servers are usually mounted in racks. A rack often holds a few dozen servers. Each rack connects to a top-of-rack (ToR) switch. ToR switches connect upward into a spine layer, producing a spine-leaf or Clos topology.

A spine-leaf topology gives many parallel paths between racks. It is designed for heavy server-to-server traffic.

East-west traffic is traffic between servers inside the data center. North-south traffic is traffic entering or leaving the data center. Modern services often generate much more east-west traffic than north-south traffic because one external request may trigger many internal service calls.

Bisection bandwidth is the network capacity available across a cut that splits the cluster into two parts. Higher bisection bandwidth means the network can support more communication between different parts of the cluster.

High-speed communication

Most services run well on normal Ethernet and TCP. Some workloads, especially HPC and large machine learning jobs, need lower latency and higher bandwidth.

NIC offloads reduce CPU overhead by moving selected packet-processing work to the network card.

RDMA (Remote Direct Memory Access) lets one machine read or write memory on another machine with little CPU involvement on the remote side.

InfiniBand is a high-speed network technology used for low-latency cluster communication, especially in HPC and AI training systems.

You do not need to know the details of specific GPU interconnects or RDMA-over-Ethernet technologies for this topic. The key idea is that tightly coupled computations need faster communication than ordinary web services.

Failure handling

Cluster availability depends on detecting failures, moving work away from failed machines, and preventing two machines from acting as the same authority.

Heartbeats are periodic messages that indicate a node is still alive. If enough heartbeats are missed, the system may treat the node as failed and move its work elsewhere. A short timeout detects failures quickly but risks false positives. A long timeout avoids false positives but delays recovery.

Leases give a node time-limited authority to do something, such as act as leader or write to shared state. A node must renew its lease before it expires. Heartbeats and leases often work together: heartbeats help detect liveness, while leases control authority.

Failover moves work from a failed node to another node. In an active/passive design, one node serves traffic and a standby takes over on failure. In an active/active design, multiple nodes serve traffic and load shifts to the survivors when one fails.

Fencing prevents a suspected failed node from continuing to write after another node takes over. This is necessary because a node may be alive but unreachable due to a network partition.

Split brain occurs when two parts of a system both believe they are authoritative. If both accept writes, the system may end up with conflicting histories that cannot be merged safely.

Quorum prevents split brain by requiring a majority of replicas to agree before electing a leader or committing an update. In a 5-node group, a majority is 3. If a partition splits the group into 3 nodes and 2 nodes, only the 3-node side can continue making authoritative decisions.

Borg

Borg is Google’s internal cluster management system. It is important because it shaped many of the ideas later seen in Kubernetes.

The basic structure is:

A Borgmaster acts as the control plane for a Borg cell.
A borglet runs on each machine and starts, stops, and monitors tasks.
A job is a collection of tasks.
A task is one running instance of a program.
An alloc is a reserved set of resources on a machine where one or more related tasks can run.

Borg uses declarative configuration. A user describes what should run, how many copies are needed, and what resources are required. Borg decides where to place the work and keeps it running.

Borg also uses priority and quota. Priority decides which work should win when resources are scarce. Quota limits how much a user or team can consume. Higher-priority work can preempt lower-priority work.

The scheduler has two broad steps:

Filtering removes machines that cannot run the task.
Scoring ranks the remaining machines and chooses a placement.

Tasks are isolated using Linux mechanisms such as cgroups and namespaces. Cgroups enforce resource limits such as CPU and memory. If a task exceeds its memory limit, the kernel may kill it, and Borg can restart it.

The main Borg lessons are that declarative control works at large scale, replicated control-plane state is essential, and mixed workloads can share machines when priority, quota, and isolation are handled carefully.

Kubernetes

Kubernetes is the open-source cluster manager most directly influenced by Borg. It provides a portable way to run containers across machines.

Kubernetes uses a declarative model. Users submit objects that describe desired state. Controllers then work to make actual state match desired state.

The main workload objects are:

Pod: The smallest unit Kubernetes schedules; it contains one or more tightly related containers that share networking and volumes.
Deployment: An object that keeps a desired number of identical pods running and supports rolling updates.
Service: A stable virtual IP address and DNS name that routes traffic to a changing set of pods.

Kubernetes control plane

The API server is the entry point to the cluster. Users and components communicate through it.

etcd stores cluster state using Raft replication.

The scheduler assigns pending pods to nodes. It filters out nodes that cannot run the pod, scores the remaining nodes, and writes the chosen node back to the API server.

The controller manager runs controllers that keep actual state close to desired state. For example, if a Deployment says 10 pods should exist and only 9 are running, a controller creates another pod.

The cloud controller manager connects Kubernetes to cloud-provider features such as load balancers and block storage.

Kubernetes worker nodes

Each worker node runs a few core components.

Kubelet is the local node agent. It watches for pods assigned to the node, asks the container runtime to start them, and reports status.

Container runtime runs the containers. Common runtimes include containerd and CRI-O.

Kube-proxy helps implement Service networking so traffic sent to a Service reaches one of the current backing pods.

Controllers and reconciliation

A controller is a control loop. It watches the API server, compares desired state with actual state, and takes action when they differ.

This process is called reconciliation. The controller repeatedly observes the cluster, compares what exists with what should exist, and acts to close the gap.

Kubernetes gets much of its behavior by composing many small controllers. One controller manages Deployments, another manages ReplicaSets, another manages nodes, and others manage Services and endpoints.

Kubernetes can also be extended with new API object types and custom controllers. This is how operators, storage integrations, and service meshes plug into Kubernetes without changing the Kubernetes core.

Services and discovery

Pods are temporary. They can be created, destroyed, moved, or replaced, and their IP addresses change.

A Service gives a changing set of pods a stable virtual IP address and DNS name. Clients talk to the Service, not to individual pods.

Kubernetes keeps track of which pods currently back the Service. As pods come and go, the Service name stays the same while the endpoint list changes.

External traffic usually enters through a LoadBalancer Service or an Ingress. A LoadBalancer Service asks the environment for an external load balancer. An Ingress provides HTTP-level routing to one or more Services.

Load balancing

A load balancer sends each request or connection to one healthy backend. It spreads load, hides individual backend failures, gives clients a stable address, and supports rolling updates by draining traffic before replacement.

Layer 4 load balancing operates on TCP or UDP. It sees IP addresses and ports, not HTTP requests. It is fast and works for many protocols.

Layer 7 load balancing operates at the application layer, usually HTTP. It can route based on URL path, headers, cookies, or method.

With HTTPS, a Layer 7 load balancer can inspect the HTTP request only if TLS terminates at the balancer. The balancer decrypts the request, makes the routing decision, and then forwards the request to the backend. The backend connection may be re-encrypted.

Common load-balancing algorithms are:

Round robin: Sends requests to backends in rotation.
Least connections: Sends the next request to the backend with the fewest active connections.
Weighted balancing: Sends different shares of traffic to different backends, often based on capacity or rollout policy.
Latency-aware routing: Prefers backends that have been responding quickly.
Power of two choices: Samples two backends and sends the request to the less loaded one.

Health checks

A load balancer needs an accurate view of backend health. It usually checks backends with periodic probes.

Liveness asks whether the process should be restarted.

Readiness asks whether the process should receive traffic.

Startup gives slow-starting services time before liveness checks apply.

Health checks should be cheap, frequent, and specific to the service. A TCP connection can succeed even when the application is stuck, so application-aware checks are often better.

Stateless and stateful services

A stateless service keeps no important local state between requests. Any instance can serve any request. This makes the service easy to load balance, restart, scale, and update.

Stateless does not mean the service never uses state. It means durable state lives somewhere else, such as a database, cache, object store, or queue.

A stateful service must preserve data or identity across restarts. Databases, caches, and coordination systems are common examples.

Stateful services require more careful handling: replication, sharding, backup, failover, recovery, and migration all become part of the design. Kubernetes StatefulSets help by giving pods stable identities and ordered startup and rollout behavior.

What you should know

Focus on the role each part plays in the cluster.

You should be able to explain:

What a cluster is and why it presents a single system image.
Why commodity hardware requires software-based reliability.
The difference between HA, HPC, load-balancing, storage, and scheduling clusters.
Why east-west traffic is important in modern data centers.
How heartbeats, leases, failover, fencing, and quorum help handle failures.
Why split brain is dangerous and how quorum prevents it.
The basic structure of Borg: Borgmaster, borglet, jobs, tasks, and scheduling.
The basic structure of Kubernetes: API server, etcd, scheduler, controllers, kubelet, pods, deployments, and services.
The difference between Layer 4 and Layer 7 load balancing.
Why HTTPS requires TLS termination for Layer 7 routing.
How basic load-balancing algorithms differ.
Why health checks need liveness, readiness, and startup signals.
Why stateless services are easier to scale and operate than stateful services.

What You Don’t Need to Study

You do not need to study implementation details or vendor-specific mechanisms. Focus on the concepts: what problem each cluster component solves and how the pieces fit together.

You do not need to know:

Historical details about who built early systems, generations of clusters, publication dates, or paper authors.
Specific product examples used to illustrate cluster categories.
Borg details beyond the basic idea that it is Google’s internal cluster manager and that it influenced Kubernetes.
Borg allocs, priority bands, quota mechanisms, resource reclamation, and detailed scheduling policies.
Kubernetes beyond the basic concepts of pods, the scheduler, and kubelets.
Kubernetes Services, DNS names, EndpointSlices, kube-proxy internals, Ingress, or LoadBalancer Service details.
Kubernetes controller internals beyond the basic idea that controllers compare the desired state with the actual state.
Health check types or health-check configuration.
Load-balancing algorithms beyond the general purpose of load balancing.
The power-of-two-choices algorithm.
Paxos or Raft details in the context of this lecture.
Quorum details beyond the idea that a majority prevents two sides of a partition from both acting as primary.
Network connectivity details beyond top-of-rack switches, spine-leaf topology, and the purpose of RDMA.
NIC offload details, RoCE, InfiniBand details, NVLink, NVSwitch, or specific network speeds.
Linux container internals beyond the idea that containers isolate tasks and limit their resource use.
Specific numbers such as rack size, Borg cell size, replica counts, bandwidth values, or downtime calculations for different numbers of nines.

Focus your study on the main ideas: clusters present many machines as one managed system, schedulers place work on machines, Kubernetes runs pods through node agents, HA uses redundancy and failover, load balancers route traffic to healthy replicas, and stateless services are easier to scale and replace.