Last week, we looked at network-attached storage: NFS, AFS, and Coda. Those systems are fundamentally about making a file server’s storage accessible over a network. They work well for workloads measured in gigabytes and users in the thousands, but they share a structural limit: there is one server (or a small cluster) that owns the data. Everything flows through it.

This week asks a different question. What happens when the data is so large, or the read load so intense, that no single server can handle it? The answer is to stop thinking about a “file server” and start thinking about a storage system made of thousands of cooperating nodes, where data is spread across all of them.

Part 1: The Google File System

The Problem GFS Was Trying to Solve

By the late 1990s, Google was running web crawlers, building inverted indexes, and performing large-scale log analysis. The data volumes involved were in the hundreds of terabytes and growing. The workload had a specific character: files were enormous (often multi-gigabyte), written once and read many times, reads were mostly sequential rather than random, and many clients simultaneously appended to shared files.

None of the existing file systems handled this well. NFS was designed for small files and interactive use. AFS optimized for whole-file caching but was not built for 100-node parallel jobs. Local file systems on individual servers could not hold the data at all. Google needed something new, and in 2003, a team of Google engineers published the Google File System paper describing what they built.

The GFS design makes a set of deliberate bets. Commodity hardware will fail constantly; fault tolerance must be automatic and built in. Files will be large; optimizing for small files is a wasted effort. Appending to existing files is common; random overwrites are rare. Throughput matters more than latency. With those bets in place, a clean architecture follows.

GFS Architecture

A GFS deployment consists of a single master and many chunkservers, with clients that talk to both.

Files are divided into fixed-size chunks of 64 MB each. Each chunk is identified by a globally unique 64-bit handle assigned by the master at creation time. Chunks are stored as ordinary Linux files on chunkserver disks. Each chunk is replicated on multiple chunkservers; the default replication factor is three.

The master maintains all file system metadata. This includes the namespace (the directory tree), the mapping from file names to the list of chunks that compose each file, and the locations of each chunk replica. It also manages access control information and chunk lease assignments.

Chunk locations are a special case. The master does not write chunk locations to disk. Instead, it polls all chunkservers at startup and whenever a new chunkserver joins the cluster. Chunkservers report which chunks they hold. This is has a startup overhead but is easier than trying to keep the master’s view consistent with chunkserver reality across crashes and restarts.

Everything else the master needs to survive a crash is written to an operation log stored on disk and replicated on remote machines. The log is the authoritative record of the file system state. When the master restarts, it replays the log to reconstruct its state. To keep log replay fast, the master periodically checkpoints its state.

Metadata in Memory

The master keeps all metadata in memory. This is what makes metadata operations fast. A typical GFS deployment used about 64 bytes of metadata per chunk; a file system with a million chunks needed only 64 MB of master memory. This scales surprisingly well for large files because each 64 MB chunk covers a lot of data.

Because all metadata lives in RAM and all operations go through the master, this can be a bottleneck. The GFS designers accepted this. The master is not involved in data transfers at all; once a client knows which chunkserver holds a given chunk, it reads and writes directly to the chunkserver. The master handles only metadata queries and lease management. Keeping metadata in RAM reduces access latency, alleviating much of the bottleneck.

Why Large Chunks?

The 64 MB chunk size is unusually large compared to conventional file systems, where block sizes are typically 4 KB to 1 MB. The choice is deliberate and has several benefits.

A large chunk means fewer chunks per file, which means less metadata at the master. The paper notes that the master needs fewer than 64 bytes of metadata per chunk; a million chunks requires only 64 MB of master memory. With 64 MB chunks, even a very large file consists of a manageable number of chunks.

Large chunks also reduce how often clients need to contact the master. A single metadata query returns location information covering a large amount of data, and the client can cache those chunk locations for a long time without them going stale. For a sequential read through a large file, the client may need to contact the master only a handful of times.

Finally, large chunks make it practical to keep a persistent TCP connection open to a chunkserver for an extended period, amortizing the cost of connection setup across many reads and writes.

Client Caching

Neither GFS clients nor chunkservers cache file data. File data is large and mostly sequential; a cache would rarely produce a hit and would consume memory better used elsewhere. The operating system’s buffer cache provides some caching incidentally, but GFS does not rely on it.

Clients do cache chunk location metadata returned by the master. A client that knows which chunkservers hold a given chunk can read from them repeatedly without re-querying the master. Cached metadata has a timeout to limit the exposure to stale information after a chunkserver failure or chunk migration.

Most file systems represent a directory as a data structure that lists the files it contains. GFS takes a simpler approach: the namespace is a single lookup table that maps full pathnames directly to metadata. There are no per-directory data structures and no support for hard links or symbolic links. The flat lookup table is faster to search and easier to manage at the master, and since GFS does not need to support the full range of POSIX directory semantics, the simplification is worthwhile.

GFS relies on Chubby, Google’s distributed lock service, to manage master availability. The master holds a Chubby lock that serves as its claim to be the active master. An external monitoring system watches the master; if it stops responding, the monitor detects that the Chubby lock has expired and starts a new master process, which acquires the lock before serving any requests. Clients and chunkservers can always find the current master by looking up a well-known name in Chubby. This gives GFS automatic master failover without requiring a custom leader election protocol.

Reading Data

When a client wants to read a file, it sends the master the file name and the byte offset. The master translates the offset into a chunk index, looks up the chunk handle and the replica locations, and returns that information to the client. The client caches this mapping for a short time and then contacts one of the chunkservers directly to retrieve the data. Subsequent reads within the same chunk go directly to the chunkserver without consulting the master again.

Writing Data: Two Phases

Writes in GFS are more complex than reads because they must reach all replicas in a consistent order. GFS separates the two concerns of getting data to the replicas and deciding the order in which mutations are applied.

Before any write can proceed, the master grants a lease to one of the replicas for that chunk. The leaseholder becomes the primary for the duration of the lease, typically 60 seconds and renewable as long as the chunk is being mutated. When granting the lease, the master also increments the chunk version number and notifies all replicas of the new version. Version numbers allow the master and chunkservers to detect stale replicas: if a chunkserver was down when a lease was granted, it missed the version increment and its copy of the chunk is now out of date. The master will not direct clients to a stale replica and will re-replicate the chunk from an up-to-date copy.

Phase 1: Data transfer. The client pushes the data to the nearest replica (by network topology). That replica forwards the data to the next chunkserver in the chain, and that one forwards to the next, until all replicas have buffered the data. This pipelining is deliberate: each network link carries the data exactly once. If the client sent the data to all replicas directly, it would need to push the same bytes over three separate connections simultaneously, potentially saturating its own uplink. With pipelining, the client saturates one link, the first chunkserver saturates one link, and so on; the total time scales with the number of bytes plus a small per-hop latency, not with the replication factor. The primary has not been asked to do anything yet; this phase is purely data movement.

Phase 2: Write request. Once all replicas acknowledge that they have received and buffered the data, the client sends a write request to the primary. The primary assigns a serial number to the mutation and applies it to its local state. It then forwards the write request with the serial number to all secondary replicas. The secondaries apply the mutation in the same serial order. Each secondary acknowledges success to the primary, and the primary acknowledges success to the client.

GFS separates data flow from control flow. In the data flow phase, data moves from client to chunkservers with no ordering constraints; the goal is simply to get bytes to all replicas as efficiently as possible. In the control flow phase, the primary imposes ordering by assigning serial numbers, ensuring all replicas apply mutations in the same sequence. Keeping these two concerns separate is what allows the pipelining optimization in phase 1 while still guaranteeing consistency in phase 2.

If any replica fails to acknowledge, the client must retry. GFS provides defined consistency for successful mutations: all replicas of a chunk are byte-for-byte identical after a successful write.

Additional Operations

GFS includes two operations beyond the standard create, delete, open, close, read, and write.

Record append lets a client specify data and ask GFS to append it atomically to a file, with GFS choosing the offset. The semantics guarantee that the data will appear at least once as a contiguous region in the file, even if multiple clients are appending concurrently. This is extremely useful for producer-consumer workflows: many worker processes can write results to a shared output file without coordinating with each other. Record append makes this safe without explicit locking.

Snapshot creates a copy of a file or directory tree at low cost. Rather than copying all the data immediately, GFS uses copy-on-write: the snapshot shares chunks with the original until one of them is modified, at which point the master creates a new copy of the affected chunk. This makes snapshots fast and space-efficient, which is important when checkpointing large intermediate results in a long computation.

Fault Tolerance

GFS assumes hardware failures are the norm, not the exception. Several mechanisms work together to handle them.

Chunkservers send heartbeats to the master periodically. If the master stops hearing from a chunkserver, it marks all chunks on that server as under-replicated and schedules re-replication on surviving chunkservers. This happens automatically.

Each chunk replica stores a checksum covering 64 KB blocks of data. When a chunkserver reads a chunk to serve a client request or during a background scan, it verifies the checksum. A mismatch indicates corruption; the chunkserver reports the error to the master and requests a fresh copy from another replica. Checksums are stored persistently, separate from the chunk data, so they survive restarts.

The master itself is replicated. Its operation log is written to multiple machines, so a replacement master can recover quickly by replaying the log from a recent checkpoint.

What GFS Does Not Provide

Understanding the trade-offs is as important as understanding the features. GFS does not expose a standard file system interface. Applications do not access GFS through the operating system’s Virtual File System (VFS) layer. There is no GFS file system driver that makes a GFS volume appear as a directory in the local file tree. Instead, GFS provides a library API: applications are written specifically to call GFS functions for open, read, write, and append. This means that existing programs cannot use GFS without modification, but it also means that the API can be designed around the actual workload rather than constrained by POSIX semantics.

Beyond the API limitation, GFS does not provide mmap, byte-range locking, or support for multiple writers at the same byte offset. Consistency under concurrent random writes is not guaranteed to be identical across replicas; the system relies on application-level logic (checksums, filtering duplicate records) to handle inconsistencies that arise from concurrent access. The system was built for a specific workload, and it solves that workload very well.

Part 2: HDFS

GFS for the Open-Source World

The Hadoop Distributed File System (HDFS) was built at Yahoo in 2006 to support the Hadoop MapReduce implementation. Doug Cutting and the Hadoop team needed a file system with the same characteristics as GFS: large files, sequential access, fault tolerance through replication. The GFS paper was public; they modeled HDFS on it.

The mapping from GFS to HDFS is nearly direct:

GFS	HDFS
Master	NameNode
Chunkserver	DataNode
Chunk (64 MB)	Block (128 MB default)
Operation log + checkpoint	Edit log + FsImage

The architecture is the same. One NameNode stores all namespace metadata in memory. DataNodes store blocks on local disk. Clients contact the NameNode for metadata and then read and write directly to DataNodes. Replication defaults to three. Heartbeats and block reports flow from DataNodes to the NameNode continuously.

Notable Differences from GFS

HDFS uses a default block size of 128 MB, which suits the very large files common in MapReduce batch jobs. Larger blocks reduce the number of metadata entries the NameNode must track.

HDFS is written in Java, which made it accessible to a much wider developer community but introduced JVM overhead and garbage collection pauses that GFS’s C++ implementation avoided.

The original HDFS had a hard single point of failure: one NameNode with no automatic failover. If it crashed, the file system was unavailable until it restarted manually. This was a known limitation. Subsequent versions added NameNode High Availability using ZooKeeper for automatic failover, and HDFS Federation to support multiple independent NameNodes managing different portions of the namespace.

Beyond GFS

GFS served Google well for about a decade, but the workload eventually evolved. Files got smaller and more numerous, latency became more important, and the single-master model for metadata became a bottleneck at petabyte scale with billions of files.

Google replaced GFS with Colossus around 2010. Colossus distributes the metadata function across many nodes using a distributed metadata service and moves from a single large master to a hierarchical metadata architecture. The design is not published in detail, but the direction is clear: the single-master model has a ceiling, and production systems at Google’s scale hit it.

HDFS followed a similar path with Federation. Both systems converged on the same insight: metadata at scale is a distributed systems problem in its own right, requiring the same techniques used for data: replication, partitioning, and coordination.

Part 3: Dropbox

A Consumer-Facing Example of the Same Principles

GFS and HDFS were built for internal batch processing workloads. Dropbox, launched in 2008, applies the same core idea (separate data from metadata, store data on scalable block storage) to a consumer file synchronization service. The architecture is less exotic, but the design decisions are instructive, particularly because Dropbox had to scale from a single server in its first version to a system handling hundreds of millions of users.

What Dropbox Does

Dropbox synchronizes a designated directory on a user’s machine with cloud storage. Changes on any device propagate to the server and then to all other devices the user has connected. The service is transparent to the user; programs running on the machine read and write local files normally and the Dropbox client handles synchronization in the background.

One important characteristic distinguishes Dropbox from more common read-heavy services like social media: the read-to-write ratio is close to 1. Because the primary use case is synchronization, data stored on Dropbox servers is rarely read except to push changes to another device. This means Dropbox faces a particularly high volume of uploads.

Block-Level Deduplication

Rather than treating a file as a single object to upload when it changes, the Dropbox client chunks each file into blocks of 4 MB. Each block is identified by the SHA-256 hash of its content. Before uploading anything, the client sends the server the list of block hashes for the modified file. The server responds with which hashes it already has. The client uploads only the missing blocks.

This has two major benefits. First, if a 2 GB file changes in one place, only the affected block needs to be uploaded, not the entire file. Second, if two users store identical files, the blocks are deduplicated on the server; only one copy of each block is stored, regardless of how many users have it. Because blocks are identified by content hash, this deduplication is automatic and requires no coordination between users.

Separating Data from Metadata

Like GFS, Dropbox separates the data plane from the control plane. Block data (the actual file content) is stored on Amazon S3 cloud-based object storage service (and later on Dropbox’s own storage infrastructure). Metadata (file names, directory structure, the list of blocks that compose each file, and version history) is stored in Dropbox’s own database servers.

When a client wants to download a changed file, it contacts the metadata service to get the list of blocks that make up the current version. It then fetches any blocks it does not already have from the block store. The metadata server is never in the data path for block transfers.

The Notification Problem

Early Dropbox clients polled the server periodically to check for changes. With tens of thousands of clients each polling every few seconds, the server spent most of its resources answering “anything new?” queries with the answer “no.” The solution was a notification server. Clients establish a persistent TCP connection to the notification server. When a change occurs, the notification server pushes a message to the affected clients telling them to sync. The clients then contact the metadata server to find out what changed.

Clients may be behind firewalls or NAT that prevent the notification server from initiating a connection to them. The client-initiated persistent TCP connection sidesteps this: the client opens the connection, and the server uses it to send notifications whenever they arrive.

As the user base grew, a single notification server could not hold enough connections. Each server could manage roughly one million concurrent TCP connections, so Dropbox introduced a two-level hierarchy of notification servers to handle hundreds of millions of clients.

Scaling the Architecture

Dropbox’s scaling story follows a familiar pattern. It started as a single server with a MySQL database. As data volume grew, block storage moved to S3. As query volume grew, the single server split into specialized services: a metadata server, a block server, and a notification server. As user volume continued to grow, each service was replicated and load-balanced. The metadata database added read replicas and caching layers.

The central insight at each step was the same one GFS encoded from the beginning: keep the metadata and data planes separate, and ensure that the metadata service is never in the data path for bulk transfers.

References

S. Ghemawat and H. Gobioff and S. Leung, The Google File System, Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003, pages 20-43.
K. Shvachko, H. Kuang, S. Radia and R. Chansler, “The Hadoop Distributed File System,” 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1-10
HDFS Architecture, Apache Hadoop Project

Decentralized Storage - Distributed File Systems