Last week, we looked at network-attached storage: NFS, AFS, and Coda. Those systems are fundamentally about making a file server’s storage accessible over a network. They work well for workloads measured in gigabytes and users in the thousands, but they share a structural limit: there is one server (or a small cluster) that owns the data. Everything flows through it.
This week asks a different question. What happens when the data is so large, or the read load so intense, that no single server can handle it? The answer is to stop thinking about a “file server” and start thinking about a storage system made of thousands of cooperating nodes, where data is spread across all of them.
Part 1: The Google File System
The Problem GFS Was Trying to Solve
By the late 1990s, Google was running web crawlers, building inverted indexes, and performing large-scale log analysis. The data volumes involved were in the hundreds of terabytes and growing. The workload had a specific character: files were enormous (often multi-gigabyte), written once and read many times, reads were mostly sequential rather than random, and many clients simultaneously appended to shared files.
None of the existing file systems handled this well. NFS was designed for small files and interactive use. AFS optimized for whole-file caching but was not built for 100-node parallel jobs. Local file systems on individual servers could not hold the data at all. Google needed something new, and in 2003, a team of Google engineers published the Google File System paper describing what they built.
The GFS design makes a set of deliberate bets. Commodity hardware will fail constantly; fault tolerance must be automatic and built in. Files will be large; optimizing for small files is a wasted effort. Appending to existing files is common; random overwrites are rare. Throughput matters more than latency. With those bets in place, a clean architecture follows.
GFS Architecture
A GFS deployment consists of a single master and many chunkservers, with clients that talk to both.
Files are divided into fixed-size chunks of 64 MB each. Each chunk is identified by a globally unique 64-bit handle assigned by the master at creation time. Chunks are stored as ordinary Linux files on chunkserver disks. Each chunk is replicated on multiple chunkservers; the default replication factor is three.
The master maintains all file system metadata. This includes the namespace (the directory tree), the mapping from file names to the list of chunks that compose each file, and the locations of each chunk replica. It also manages access control information and chunk lease assignments.
Chunk locations are a special case. The master does not persist chunk locations to disk. Instead, it polls all chunkservers at startup and whenever a new chunkserver joins the cluster. Chunkservers report which chunks they hold. This is easier than trying to keep the master’s view consistent with chunkserver reality across crashes and restarts.
Everything else the master needs to survive a crash is written to an operation log stored on disk and replicated on remote machines. The log is the authoritative record of the file system state. When the master restarts, it replays the log to reconstruct its state. To keep log replay fast, the master periodically checkpoints its state.
Metadata in Memory
The master keeps all metadata in memory. This is what makes metadata operations fast. A typical GFS deployment in 2003 used about 64 bytes of metadata per chunk; a file system with a million chunks needed only 64 MB of master memory. This scales surprisingly well for large files because each 64 MB chunk covers a lot of data.
Because all metadata lives in RAM and all operations go through the master, this can be a bottleneck. The GFS designers accepted this. The master is not involved in data transfers at all; once a client knows which chunkserver holds a given chunk, it reads and writes directly to the chunkserver. The master handles only metadata queries and lease management. Keeping metadata in RAM reduces access latency, alleviating much of the bottleneck.
Using Chubby for Master Election
GFS relies on Chubby, Google’s distributed lock service, to manage master availability. The master holds a Chubby lock that serves as its claim to be the active master. An external monitoring system watches the master; if it stops responding, the monitor detects that the Chubby lock has expired and starts a new master process, which acquires the lock before serving any requests. Clients and chunkservers can always find the current master by looking up a well-known name in Chubby. This gives GFS automatic master failover without requiring a custom leader election protocol.
Reading Data
When a client wants to read a file, it sends the master the file name and the byte offset. The master translates the offset into a chunk index, looks up the chunk handle and the replica locations, and returns that information to the client. The client caches this mapping for a short time and then contacts one of the chunkservers directly to retrieve the data. Subsequent reads within the same chunk go directly to the chunkserver without consulting the master again.
Writing Data: Two Phases
Writes in GFS are more complex than reads because they must reach all replicas in a consistent order. GFS separates the two concerns of getting data to the replicas and deciding the order in which mutations are applied.
Before any write can proceed, the master grants a lease to one of the replicas for that chunk. The leaseholder becomes the primary for the duration of the lease, typically 60 seconds and renewable as long as the chunk is being mutated.
Phase 1: Data transfer. The client pushes the data to the nearest replica (by network topology). That replica forwards the data to the next chunkserver in the chain, and that one forwards to the next, until all replicas have buffered the data. This pipelining is deliberate: each network link carries the data exactly once. If the client sent the data to all replicas directly, it would need to push the same bytes over three separate connections simultaneously, potentially saturating its own uplink. With pipelining, the client saturates one link, the first chunkserver saturates one link, and so on; the total time scales with the number of bytes plus a small per-hop latency, not with the replication factor. The primary has not been asked to do anything yet; this phase is purely data movement.
Phase 2: Write request. Once all replicas acknowledge that they have received and buffered the data, the client sends a write request to the primary. The primary assigns a serial number to the mutation and applies it to its local state. It then forwards the write request with the serial number to all secondary replicas. The secondaries apply the mutation in the same serial order. Each secondary acknowledges success to the primary, and the primary acknowledges success to the client.
If any replica fails to acknowledge, the client must retry. GFS provides defined consistency for successful mutations: all replicas of a chunk are byte-for-byte identical after a successful write.
Atomic Record Append
GFS includes an operation not found in most file systems: record append. A client specifies data and asks GFS to append it atomically to a file, with GFS choosing the offset. The semantics guarantee that the data will appear at least once as a contiguous region in the file, even if multiple clients are appending concurrently.
This is extremely useful for producer-consumer workflows. Many worker processes can write results to a shared output file without coordinating with each other. Record append makes this safe without explicit locking.
Fault Tolerance
GFS assumes hardware failures are the norm, not the exception. Several mechanisms work together to handle them.
Chunkservers send heartbeats to the master periodically. If the master stops hearing from a chunkserver, it marks all chunks on that server as under-replicated and schedules re-replication on surviving chunkservers. This happens automatically.
Each chunk replica stores a checksum covering 64 KB blocks of data. When a chunkserver reads a chunk to serve a client request or during a background scan, it verifies the checksum. A mismatch indicates corruption; the chunkserver reports the error to the master and requests a fresh copy from another replica. Checksums are stored persistently, separate from the chunk data, so they survive restarts.
The master itself is replicated. Its operation log is written to multiple machines, so a replacement master can recover quickly by replaying the log from a recent checkpoint.
What GFS Does Not Provide
Understanding the trade-offs is as important as understanding the features. GFS does not expose a standard file system interface. Applications do not access GFS through the operating system’s Virtual File System (VFS) layer. There is no GFS file system driver that makes a GFS volume appear as a directory in the local file tree. Instead, GFS provides a library API: applications are written specifically to call GFS functions for open, read, write, and append. This means that existing programs cannot use GFS without modification, but it also means that the API can be designed around the actual workload rather than constrained by POSIX semantics.
Beyond the API limitation, GFS does not provide mmap, byte-range locking, or support for multiple writers at the same byte offset. Consistency under concurrent random writes is not guaranteed to be identical across replicas; the system relies on application-level logic (checksums, filtering duplicate records) to handle inconsistencies that arise from concurrent access. The system was built for a specific workload, and it solves that workload very well.
Part 2: HDFS
GFS for the Open-Source World
The Hadoop Distributed File System (HDFS) was built at Yahoo in 2006 to support the Hadoop MapReduce implementation. Doug Cutting and the Hadoop team needed a file system with the same characteristics as GFS: large files, sequential access, fault tolerance through replication. The GFS paper was public; they modeled HDFS on it.
The mapping from GFS to HDFS is nearly direct:
| GFS | HDFS |
|---|---|
| Master | NameNode |
| Chunkserver | DataNode |
| Chunk (64 MB) | Block (128 MB default) |
| Operation log + checkpoint | Edit log + FsImage |
The architecture is the same. One NameNode stores all namespace metadata in memory. DataNodes store blocks on local disk. Clients contact the NameNode for metadata and then read and write directly to DataNodes. Replication defaults to three. Heartbeats and block reports flow from DataNodes to the NameNode continuously.
Notable Differences from GFS
HDFS uses a default block size of 128 MB, which suits the very large files common in MapReduce batch jobs. Larger blocks reduce the number of metadata entries the NameNode must track.
HDFS is written in Java, which made it accessible to a much wider developer community but introduced JVM overhead and garbage collection pauses that GFS’s C++ implementation avoided.
The original HDFS had a hard single point of failure: one NameNode with no automatic failover. If it crashed, the file system was unavailable until it restarted manually. This was a known limitation. Subsequent versions added NameNode High Availability using ZooKeeper for automatic failover, and HDFS Federation to support multiple independent NameNodes managing different portions of the namespace.
Beyond GFS
GFS served Google well for about a decade, but the workload eventually evolved. Files got smaller and more numerous, latency became more important, and the single-master model for metadata became a bottleneck at petabyte scale with billions of files.
Google replaced GFS with Colossus around 2010. Colossus distributes the metadata function across many nodes using a distributed metadata service and moves from a single large master to a hierarchical metadata architecture. The design is not published in detail, but the direction is clear: the single-master model has a ceiling, and production systems at Google’s scale hit it.
HDFS followed a similar path with Federation. Both systems converged on the same insight: metadata at scale is a distributed systems problem in its own right, requiring the same techniques used for data – replication, partitioning, and coordination.
Part 3: Dropbox
A Consumer-Facing Example of the Same Principles
GFS and HDFS were built for internal batch processing workloads. Dropbox, launched in 2008, applies the same core idea – separate data from metadata, store data on scalable block storage – to a consumer file synchronization service. The architecture is less exotic, but the design decisions are instructive, particularly because Dropbox had to scale from a single server in its first version to a system handling hundreds of millions of users.
What Dropbox Does
Dropbox synchronizes a designated directory on a user’s machine with cloud storage. Changes on any device propagate to the server and then to all other devices the user has connected. The service is transparent to the user; programs running on the machine read and write local files normally and the Dropbox client handles synchronization in the background.
One important characteristic distinguishes Dropbox from read-heavy services like social media: the read-to-write ratio is close to 1. Because the primary use case is synchronization, data stored on Dropbox servers is rarely read except to push changes to another device. This means Dropbox faces a particularly high volume of uploads.
Block-Level Deduplication
Rather than treating a file as a single object to upload when it changes, the Dropbox client chunks each file into blocks of fixed size. Each block is identified by the SHA-256 hash of its content. Before uploading anything, the client sends the server the list of block hashes for the modified file. The server responds with which hashes it already has. The client uploads only the missing blocks.
This has two major benefits. First, if a 2 GB file changes in one place, only the affected block needs to be uploaded, not the entire file. Second, if two users store identical files, the blocks are deduplicated on the server; only one copy of each block is stored, regardless of how many users have it. Because blocks are identified by content hash, this deduplication is automatic and requires no coordination between users.
Separating Data from Metadata
Like GFS, Dropbox separates the data plane from the control plane. Block data (the actual file content) is stored on Amazon S3 cloud-based object storage service (and later on Dropbox’s own storage infrastructure). Metadata (file names, directory structure, the list of blocks that compose each file, and version history) is stored in Dropbox’s own database servers.
When a client wants to download a changed file, it contacts the metadata service to get the list of blocks that make up the current version. It then fetches any blocks it does not already have from the block store. The metadata server is never in the data path for block transfers.
The Notification Problem
Early Dropbox clients polled the server periodically to check for changes. With tens of thousands of clients each polling every few seconds, the server spent most of its resources answering “anything new?” queries with the answer “no.” The solution was a notification server. Clients establish a persistent TCP connection to the notification server. When a change occurs, the notification server pushes a message to the affected clients telling them to sync. The clients then contact the metadata server to find out what changed.
Clients may be behind firewalls or NAT that prevent the notification server from initiating a connection to them. The client-initiated persistent TCP connection sidesteps this: the client opens the connection, and the server uses it to send notifications whenever they arrive.
As the user base grew, a single notification server could not hold enough connections. Each server could manage roughly one million concurrent TCP connections, so Dropbox introduced a two-level hierarchy of notification servers to handle hundreds of millions of clients.
Scaling the Architecture
Dropbox’s scaling story follows a familiar pattern. It started as a single server with a MySQL database. As data volume grew, block storage moved to S3. As query volume grew, the single server split into specialized services: a metadata server, a block server, and a notification server. As user volume continued to grow, each service was replicated and load-balanced. The metadata database added read replicas and caching layers.
The central insight at each step was the same one GFS encoded from the beginning: keep the metadata and data planes separate, and ensure that the metadata service is never in the data path for bulk transfers.
References
-
S. Ghemawat and H. Gobioff and S. Leung, The Google File System, Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pages 20-43.
-
K. Shvachko, H. Kuang, S. Radia and R. Chansler, “The Hadoop Distributed File System,” 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1-10
-
HDFS Architecture, Apache Hadoop Project