pk.org: CS 417/Lecture Notes

Distributed Databases

NoSQL, NewSQL, and the Architecture of Modern Data Stores

Paul Krzyzanowski – 2026-03-28

Why Relational Databases Weren’t Enough

The relational database model served the industry well for decades. Structured schemas, SQL queries, ACID transactions, and strong consistency gave developers a predictable, reliable foundation. But by the early 2000s, companies like Google, Amazon, and Facebook were running into hard limits.

The problem wasn’t the relational model itself. It was the assumption baked into nearly every relational database: that the system ran on a single machine, or at most a few tightly coupled ones.

Scaling vertically (buying a bigger server) eventually hits a wall. Scaling horizontally (adding more servers) is where relational databases struggle. Breaking up the data of a relational database across hundreds of nodes while maintaining ACID semantics, consistent views, and flexible schemas is hard, and the coordination overhead (locking, two-phase commit, deadlock detection) can introduce substantial latency.

At the same time, the data being collected looked nothing like the clean, structured rows that relational databases were designed for. Google was crawling billions of web pages. Amazon was storing product catalogs with wildly different attributes per item. Facebook was tracking social graphs with billions of edges. None of this fit neatly into fixed-schema tables: tables that were defined when the database instance was first created.

The result was a wave of new storage systems, loosely grouped under the label NoSQL (originally “Not Only SQL”), each designed around a specific set of trade-offs. Some are optimized for read-heavy workloads, others for write throughput, others for geographic distribution, and others for flexible schemas. None of these databases tried to do everything.

The term NoSQL is used broadly and can mean several related things. In the most literal sense, it refers to systems that do not use SQL, the Structured Query Language, to interact with the database. More generally, it refers to databases that do not follow the traditional relational model of tables with fixed schemas and explicit relationships between tables.

NoSQL systems also often relax some of the strong guarantees commonly associated with relational databases, such as full ACID transactions, to improve scalability, availability, or performance, although some do provide some of these guarantees.

NoSQL Database Categories

NoSQL systems are not all built around the same data model or the same design goals. Some are optimized for simple key-based access, some for massive scalable storage, some for flexible document structures, and some for traversing relationships.

We already looked at key-value stores when we covered Amazon Dynamo and distributed hash tables more generally. In a key-value store, the system maps a key to an opaque value. This model is simple and often highly scalable, but it usually gives the application only limited query capability beyond looking up data by key. Dynamo is the main example we studied in that category.

In this part of the course, we will focus on column-family stores, also called wide-column stores. These systems organize data into rows and columns, but unlike relational databases, different rows do not need to have exactly the same columns. They are designed for large-scale distributed storage and high write throughput. Bigtable, HBase, and Cassandra fall into this category. Cassandra’s own documentation describes it as using a partitioned wide-column storage model.

We will also look at Google Spanner, which does not fit neatly into the older NoSQL categories. Spanner keeps a relational model and SQL interface, but extends them to a globally distributed setting. It is better thought of as a horizontally scalable distributed SQL system, often placed in the NewSQL category rather than grouped with the classic NoSQL designs.

Another category is the document store. A document database stores self-describing records, typically in a JSON-like format, with fields and nested structures inside each document. This gives applications more schema flexibility than a relational table, and usually supports richer queries over document fields than a basic key-value store. MongoDB and CouchDB are common examples. MongoDB stores data as BSON documents composed of field-value pairs, and CouchDB uses JSON documents with an HTTP API.

A final major category is the graph database. Graph databases are designed for data where relationships are central. Instead of organizing data primarily as tables or documents, they store entities as nodes and the connections among them as relationships, with properties attached to both. This makes them useful for workloads such as social networks, recommendation engines, fraud detection, and knowledge graphs, where traversing connections efficiently matters. Neo4j is a well-known example.


Bigtable

Google’s web crawl infrastructure in the early 2000s needed to store billions of web pages, along with their metadata, link structure, and multiple historical versions. A traditional relational database was not going to work: the data was too sparse (most pages don’t have most attributes), too large, and needed to be distributed across thousands of machines.

The result was Bigtable, described in a 2006 OSDI paper.

The design goals were:

Bigtable is not a relational database. It does not offer SQL, joins, foreign keys, or general multi-row transactions. Instead, it provides a distributed storage system for structured data with a simple data model designed to scale, perform well, and serve many different Google workloads.

The paper also emphasizes that this model gives applications dynamic control over how data is organized. Bigtable was used by many internal Google services, including Google Earth, Google Finance, web indexing, and personalized search.

The Data Model

Bigtable’s data model is easiest to understand if you start from the lookup rule:

\[(\text{row key}, \text{column key}, \text{timestamp}) \rightarrow \text{value}\]

A value in Bigtable is identified by three things:

Google describes Bigtable as a “sparse, distributed, persistent, multidimensional sorted map.” That definition is precise, but it is dense. Here is what each part means.

A map means that Bigtable acts like a very large key-value structure. Given a row key, a column key, and a timestamp, the system returns the value stored at that location. The value itself is just a sequence of bytes. Bigtable does not try to interpret it for you.

Multidimensional means the lookup key has more than one part. In an ordinary key-value store, you usually look up a value with one key. In Bigtable, the full lookup uses three dimensions: row key, column key, and timestamp; you can access older versions of data.

Sparse means that most possible cells in the table do not exist. A row may have millions of possible columns in principle, but only a small number may actually contain values. Bigtable stores only the cells that are present, which makes it efficient for data sets where different rows have different attributes.

Sorted means rows are stored in lexicographic order by row key. This is important because it makes scans over row ranges efficient, and it means that the choice of row key affects locality and performance. Rows that are close in key order are stored close together.

Distributed means the table is split across many machines. Bigtable partitions tables into row ranges called tablets, and these tablets are distributed across servers so the system can scale out.

Persistent means the data is stored durably, not just kept in memory for the moment. Bigtable is a storage system, so data remains available across process restarts and machine failures.

You can think of Bigtable as a huge sorted table where each row is identified by a row key. Its columns are organized into column families, which are just groups of related columns. Within a family, columns can be added flexibly, and each cell can store several timestamped versions. That makes Bigtable very different from a relational database, but well-suited for large-scale structured data at Google.

Rows

Each row in a Bigtable is identified by a row key. Row keys are arbitrary strings, up to 64 KB in size, and rows are stored in lexicographic order by row key. That ordering matters because nearby row keys are stored together, which affects locality and scan performance.

A good row key is one that matches the way the application will access the data. Since rows are stored in sorted order, the row key is not just an identifier. It also affects how data is clustered on storage servers.

For example, suppose you want to store web pages by domain name. If you used ordinary domain names as row keys, www.cs.rutgers.edu and www.rutgers.edu would not be grouped by their shared domain structure in any useful way. Bigtable’s paper gives the example of reversing domain names first, so these become edu.rutgers.cs.www and edu.rutgers.www. Now they share the prefix edu.rutgers, so they will be stored near each other in sorted order. That makes it easier to keep related pages together and to scan over pages from the same organization or domain hierarchy.

Column Families and Column Keys

Columns in Bigtable are organized into column families. A column family is a named group of related columns, and families must be defined in advance when the table is created.

A full column name has the form family:qualifier. The part before the colon is the column family. The part after the colon is the qualifier, which identifies a specific column within that family.

For example, suppose a row stores information about the web page www.cnn.com. If we use reversed domain names as row keys, that page might be stored in the row com.cnn.www.

One column family might be contents, which stores the page itself. Another might be anchor, which stores information about links from other pages that point to this page.

Now suppose a web crawler scans www.cnnsi.com and finds a link to www.cnn.com. The crawler can add a column to the row com.cnn.www in the anchor family. The qualifier identifies the page where the link was found, so the new column might be anchor:cnnsi.com. The value stored in that cell is the anchor text, the text a user sees and can click on for the link, such as "CNN".

If another page, such as my.look.ca, also links to www.cnn.com, Bigtable can store another column in the same row, such as anchor:my.look.ca, with that link’s visible text as the value.

The important idea is that the family is fixed, but the qualifiers within that family are not. A program may know that it wants to examine all the incoming links recorded in the anchor family for a given row, but it may not know ahead of time which source pages are present. It can iterate through the columns in that family, discover their qualifiers, and then read their values.

Timestamps and Versioning

A Bigtable cell can store multiple versions of a value. These versions all belong to the same row key and column key, and are distinguished by a 64-bit integer timestamp. Larger timestamps identify newer versions.

Timestamps can be assigned automatically by Bigtable, using time in microseconds, or supplied by the application. This lets Bigtable keep several versions of the same data item instead of overwriting the old value immediately.

You can configure how many versions to retain, or set a garbage-collection policy to discard versions older than a given age. For example, the contents family might keep the three most recent versions, allowing the system to retain a short history of how a page has changed.

How Bigtable Scales: Tablets

A Bigtable table is not stored as one giant object. It is split into tablets, where each tablet holds a contiguous range of row keys. Since rows are stored in sorted order, a tablet is really just a slice of the table’s sorted row-key space.

A table starts as a single tablet. As it grows, Bigtable splits it into smaller tablets. If a tablet becomes too large, it is split into two tablets, each covering part of the row-key range. This is how Bigtable scales to very large tables: different tablets can be placed on different tablet servers.

Each tablet is served by one tablet server at a time. A separate master server coordinates the system at a high level: it keeps track of tablet assignments, detects tablet-server failures, and reassigns tablets when needed.

Clients do not send reads and writes through the master. Instead, they use Bigtable’s metadata to determine which tablet contains a given row key, and then contact the tablet server responsible for that tablet.

Bigtable scales by partitioning a table by row-key ranges and spreading those ranges across many servers.

Bigtable relies on two Google systems we discussed earlier. It stores its persistent data files and logs in GFS, which provides distributed storage underneath the system. It uses Chubby for coordination, including master election, server discovery, and storing bootstrap location information.

How Bigtable Writes Data

Inside a tablet, Bigtable does not keep rewriting on-disk files for every update. Instead, it uses a write path built around three pieces:

  1. A write-ahead log

  2. A memtable

  3. SSTables.

When a write arrives, Bigtable first appends it to a write-ahead log. This is for recovery. If a tablet server crashes, the log can be used to reconstruct recent updates.

The write is then placed into an in-memory memtable, which keeps recent updates in sorted order. This lets Bigtable accept writes quickly without updating disk files in place.

When the memtable becomes large enough, Bigtable writes it to disk as an SSTable. An SSTable is an immutable sorted file. Immutable just means that once it is written, it is not modified in place.

So the basic write path is:

write → write-ahead log → memtable → SSTable

Reads may need to check both the memtable and the SSTables on disk, because the newest data may still be in memory while older data has already been flushed to disk.

How Bigtable Reads Data

When Bigtable reads data, it may need to look in more than one place. The newest updates may still be in the memtable, while older data may already have been written to SSTables on disk.

To answer a read, Bigtable checks the memtable and the relevant SSTables, then combines the results to find the most recent version of each requested cell. This is one reason compaction matters: if too many SSTables accumulate, reads may have to examine more files.

Compaction

Over time, a tablet accumulates multiple SSTables on disk. Bigtable periodically merges them into fewer, larger SSTables. This process is called compaction.

Compaction is important for two reasons. First, it reduces the number of files that must be checked during a read. Second, it lets Bigtable clean up older data that should no longer be kept.

You do not need to remember different kinds of compaction. The important point is that compaction is the cleanup process that keeps immutable SSTables from accumulating forever.

Example: How a Mapping System Could Use Bigtable

A mapping system is a good way to see how Bigtable’s data model could be used.

Bigtable can store the geographic data that gets assembled into the map tiles you see in your browser. The key insight is that a rendered map tile is not a single stored image, but is assembled from multiple layers of geographic features, each stored separately. Road geometry, building footprints, terrain elevation, satellite imagery, and labels all live in different column families within the same row. When a user’s browser requests a tile, the rendering pipeline reads only the column families it needs for that tile type (a satellite view needs imagery; a road map doesn’t), merges them, and renders the result.

Row keys encode the geographic location and zoom level of each segment, and are designed so that spatially adjacent segments have lexicographically adjacent row keys. This means that rendering a viewport – which requires many adjacent tiles – is a sequential range scan rather than a set of scattered random lookups.

Timestamps let Bigtable retain multiple versions of imagery. Historical satellite photos can be served alongside current data without duplicating the row structure, and a garbage collection policy automatically drops versions older than a configured threshold.

Because each layer lives in a separate column family, reads for one layer do not pull in data from others. At Google’s scale, eliminating unnecessary I/O is the difference between a fast system and an unacceptably slow one.


HBase

HBase is an open-source system modeled after Google’s Bigtable. It provides a similar wide-column data model, but runs on top of Hadoop and HDFS instead of Google’s internal infrastructure.

At a high level, the architecture follows the same ideas we just saw in Bigtable: tables are partitioned by row-key ranges, writes go through a write-ahead log and an in-memory buffer before being flushed to immutable disk files, and background compaction merges those files over time.

The names are a little different. HBase calls its tablets regions, its tablet servers RegionServers, its memtable a memstore, and its SSTables HFiles. You do not need to memorize those differences. If you understand Bigtable, you already understand the main ideas behind HBase.

HBase became part of the Hadoop ecosystem and was often used when organizations wanted Bigtable-like behavior in systems built around HDFS.


Cassandra: Combining Ideas from Bigtable and Dynamo

Cassandra is a distributed wide-column database that combines ideas from Bigtable and Dynamo. Facebook built it for Inbox Search, which had to index messages for hundreds of millions of users while handling a huge volume of writes. The system also had to stay available when machines failed and avoid any single central point of control.

Bigtable contributed the wide-column data model and the storage-engine design based on an in-memory table and immutable disk files. Dynamo contributed the decentralized architecture: partitioning data across nodes with consistent hashing, replicating data across the cluster, and letting the application choose consistency levels for reads and writes.

The result is a system with no master node, no single point of failure, and a consistency model that the application can tune.

Structure of a Cassandra Cluster

A Cassandra deployment consists of a cluster of nodes. There is no master. Any node can receive a client request and act as a coordinator for that operation. The coordinator forwards the request to the appropriate replicas and gathers the responses.

Data is partitioned across the cluster and replicated on multiple nodes. Since there is no single master, the system is designed to keep operating even when some nodes fail.

This is one of the clearest connections to Dynamo. Cassandra uses a Dynamo-like distributed hash table approach for placement and replication rather than Bigtable’s master-coordinated tablet architecture.

Partitions, Rows, and Columns

A Cassandra table stores rows, and each row contains columns.

A column is a named value within a row. For example, a restaurant row might contain columns such as address, city, phone_number, and opening_date.

Rows are grouped into partitions. A partition is the set of rows that share the same partition key. All rows in one partition are stored together on the same set of replica nodes.

So there are three levels to keep in mind:

Partition Keys

The partition key determines where data lives.

Cassandra hashes the partition key, and that hash determines which nodes store the partition. This is very similar to Dynamo. In both systems, consistent hashing is used to spread data across a ring of nodes. Each node is responsible for some range of hash values, and data is placed according to the hash of its key.

The important consequence is that the partition key gives the programmer control over which rows will live together on the same set of machines.

If two rows have different partition keys, they may be placed on completely different nodes. If they have the same partition key, they belong to the same partition and are stored together.

Clustering Columns

Within a partition, Cassandra stores rows in sorted order according to the clustering columns.

The clustering columns do not determine which machine stores the data. The partition key already did that. The clustering columns determine how rows are ordered inside the partition.

This gives the programmer control over locality within a partition. Rows that are close in clustering-key order will be stored near each other, which makes iteration and range scans efficient.

So the two kinds of keys have different jobs:

That distinction is central to understanding how Cassandra stores data.

Example: McDonald’s Restaurants

Suppose we are storing information about McDonald’s restaurants.

If the partition key is restaurant_id, then each restaurant will likely end up in a different partition. Because partition keys are hashed, the restaurants will be spread fairly evenly across the cluster. This is good for balancing load, but it does not help if we often want to look at restaurants by country or region.

Now suppose the partition key is country_code.

Then all restaurants in the same country will belong to the same partition and will be stored on the same set of replica nodes. All U.S. restaurants will live together, all Canadian restaurants will live together, and so on. This gives the programmer control over which rows are colocated.

Now suppose the clustering columns are state and county.

Within the partition for one country, Cassandra will store rows sorted first by state and then by county. That means restaurants in the same state will be next to each other, and within a state, restaurants in the same county will also be grouped together.

So if the partition key is US, rows in that partition might be ordered like this:

This is the value of clustering columns. They let the application iterate efficiently through rows in a useful order. Cassandra can scan through all U.S. restaurants, all New Jersey restaurants, or all restaurants in Middlesex County without jumping around randomly inside the partition.

In this example:

The partition key controls which rows live together on the same set of machines. The clustering columns control how those rows are ordered within the partition.

Comparison with Bigtable

Cassandra and Bigtable are both wide-column systems, so they share a family resemblance. Both store data in rows and columns, and both are designed to scale across many machines.

The main difference is how they place and order data.

Bigtable stores rows in sorted order by row key across tablets. Row-key order is global within the table, and that makes range scans over row keys efficient.

Cassandra does not keep one global sorted row order across the cluster. Instead, it hashes the partition key to decide placement. That gives even distribution and avoids dependence on one central master, but it also means that global row-key order is lost. Ordering exists only within a partition, where rows are sorted by clustering columns.

You can think of Cassandra as giving up Bigtable’s globally sorted row-key layout in exchange for Dynamo-style partitioning and decentralized operation.

Similarity to Dynamo

Cassandra’s partitioning model is close to Dynamo’s.

Both systems use consistent hashing to distribute data across a ring. Both replicate data across multiple nodes. Both are designed to avoid a single master. Both continue operating when nodes fail, as long as enough replicas remain available.

This is why Cassandra is often described as combining Bigtable’s data model with Dynamo’s distributed systems design.

Writes: Similarity to Bigtable

Cassandra’s write path looks much more like Bigtable than Dynamo.

When a write arrives, Cassandra first appends it to a commit log. This is the recovery record. If a node crashes, the commit log helps reconstruct recent writes.

The write is then applied to an in-memory memtable. The memtable holds recent updates in memory.

When the memtable becomes large enough, Cassandra flushes it to disk as an SSTable. An SSTable is an immutable sorted file. Once written, it is not updated in place.

So the write path is:

write → commit log → memtable → SSTable

That is very close to Bigtable’s write-ahead log, memtable, and SSTable design.

Reads: Similarity to Bigtable

Reads also resemble Bigtable.

A read may need to check the memtable for the newest data and also examine one or more SSTables on disk for older data. Cassandra combines those results to return the latest value for each requested item.

Over time, many SSTables can accumulate. Cassandra therefore performs compaction, which merges SSTables into fewer, larger files. Compaction reduces the number of files that reads must examine and removes obsolete data.

So Bigtable and Cassandra are quite similar in the way they store data on a node, even though they differ in how they distribute data across the cluster.

Replication and Tunable Consistency

Each partition is replicated on multiple nodes. Because data is replicated, Cassandra lets the client choose how many replicas must respond to a read or write.

At a low consistency level, such as ONE, the operation can complete after one replica responds. This improves latency and availability, but a read may return stale data.

At QUORUM, a majority of replicas must respond. This gives stronger consistency, because read and write quorums overlap.

At ALL, every replica must respond. This gives the strongest behavior, but also lowers availability.

This is one of Cassandra’s defining features. The application can choose the consistency level for each operation, depending on whether low latency or stronger consistency matters more.

What to Remember About Cassandra

Cassandra combines Dynamo’s distributed architecture with Bigtable’s storage engine.

Dynamo’s influence is in the ring, consistent hashing, replication, and the lack of a master.

Bigtable’s influence is in the wide-column model, the memtable, the SSTables, the commit log, and the compaction-based read and write path.

The partition key controls where data lives. The clustering columns control how data is ordered within a partition.


NewSQL: Google Spanner

Systems like Bigtable and Cassandra showed how to scale storage by partitioning data across many machines. Traditional relational databases showed how to provide strong transactional guarantees, but they were not designed to scale easily across data centers.

Google wanted both: horizontal scalability and full transactional semantics.

Spanner is Google’s answer to that problem. It is a distributed relational database that supports ACID transactions across machines and data centers while still scaling by partitioning data.

A useful way to think about Spanner is this:

That last point is important because once data is distributed across the world, “what happened first?” becomes a real systems problem.

What Spanner Combines

Spanner brings together several ideas we already studied:

Each of these solves a different problem.

Spanner will introduce TrueTime and commit wait, which will make the ordering of transactions line up with real time. Spanner works because these pieces fit together.

Physical Structure

Spanner has both a physical structure and a logical structure.

At the physical level, data is stored on spanservers. Spanservers are organized into zones. A zone is a large administrative and failure domain, roughly at the data center level. If one zone has a problem, replicas in other zones can continue serving data.

That geographic structure is important because Spanner is designed to replicate data across zones and across regions. The whole point is not just to survive one machine failure, but to tolerate failures at larger scales while still keeping data consistent.

At the logical level, a Spanner deployment is called a universe. A universe contains databases. A database contains tables. Tables contain rows and columns. Large tables are divided into splits, which are contiguous ranges of keys.

So the logical structure looks like this:

A split is the basic unit of distribution. It is the Spanner analogue of a Bigtable tablet.

Splits and Key Ordering

To understand splits, it helps to connect them directly to Bigtable.

Bigtable stores rows in sorted order by row key. A tablet is a contiguous range of that sorted key space. Spanner uses the same basic idea. Rows are ordered by key, and a split is a contiguous range of that ordering.

So if a table’s rows are ordered by primary key, then a split might contain one interval of those keys, and another split might contain the next interval.

This is important because it preserves locality. Rows with nearby keys are stored together. That makes range scans efficient, and it gives the system a natural way to partition large tables.

A split is therefore not just “some subset of rows.” It is a slice of the table’s sorted key space.

That direct connection to Bigtable is worth keeping in mind:

Replication and Paxos Groups

Each split is replicated across multiple servers. The replicas of a split are managed by a Paxos group.

The most precise way to say this is:

A Paxos group is the set of servers that maintain replicas of a split and use Paxos to agree on updates to that split.

Each split has its own Paxos leader, so write leadership is distributed across the database rather than concentrated in one global leader.

Note that the servers are not themselves replicated. The data is replicated across servers, and those servers together run the Paxos protocol to keep those replicas consistent.

The key structural idea is:

As long as a majority of replicas in a Paxos group are available, that split can continue to process updates.

This is one of the most important ideas in Spanner. Distribution alone is not enough. Once data is partitioned, each partition also has to survive failures and maintain a consistent order of writes. That is what Paxos is doing.

Directories

Spanner also groups related data into directories, which are sets of contiguous keys that share a prefix.

The system can move a directory as a unit. That lets Spanner keep related data together and rebalance load without moving unrelated data.

You do not need to focus on directory internals, but the idea is useful: Spanner tries to preserve locality not just at the row level, but also at the level of groups of related rows.

Transactions Across Splits

If a transaction touches data in only one split, then the Paxos group for that split can handle the transaction locally.

If a transaction touches data in more than one split, then the system needs coordination across multiple Paxos groups. This is where two-phase commit enters the picture.

This is the same basic protocol discussed previously:

That leads to an important distinction:

Those are different jobs, and Spanner uses both.

Read-Write Transactions: Strict 2PL and Wound-Wait

For read-write transactions, Spanner uses strict two-phase locking (strict 2PL), which we covered previously.

That means:

A shared lock allows multiple transactions to read the same item at the same time. That improves concurrency because readers do not have to block access to other readers.

An exclusive lock is needed to modify an item. If one transaction holds an exclusive lock, other transactions cannot read or write that item until the lock is released.

Strict 2PL means that locks are held until commit. That prevents other transactions from seeing partial results and gives serializable behavior for read-write transactions.

The cost is reduced concurrency:

That creates the possibility of deadlock. Spanner prevents deadlock with wound-wait.

In wound-wait:

This gives older transactions priority and avoids cycles of waiting.

So the read-write transaction story in Spanner ties directly to previously discussed mechanisms:

Read-Only Transactions, Snapshot Reads, and MVCC

Spanner handles read-only transactions differently.

A snapshot read means reading the database as it existed at a particular timestamp \(t\).

More precisely, for each item, the system returns the most recent committed version whose timestamp is less than or equal to \(t\).

That is a cut through time. The read is not just “old data.” It is a consistent view of the whole database at one point in time.

This is where multiversion concurrency control (MVCC) fits in.

MVCC means that the database keeps multiple committed versions of data rather than immediately overwriting old ones. Each version has a timestamp. That makes snapshot reads possible: when a read asks for the database as of time \(t\), the system can choose the correct version of each item.

This is one of the most useful features in Spanner.

Because snapshot reads use committed older versions, they usually do not need to lock current data. That means:

This is especially important for long-running reads.

Suppose you want to scan a huge table of customer records, search through a global log database, or run a report across a large fraction of the data. In a pure locking system, such a read could interfere badly with ongoing updates. It might hold locks for a long time, or it might be blocked by current writes.

With snapshot reads, the system can read a consistent historical view without locking the current rows. The search or scan sees a coherent database state, and update transactions can continue.

This only works if commit timestamps are meaningful across the whole system.

That point is crucial. A snapshot may span many splits and many data centers. If timestamps are only locally meaningful, then “the state of the database at time \(t\)” would not mean anything coherent across the whole system. Snapshot reads depend on having globally meaningful commit timestamps.

That requirement is one of the main motivations for TrueTime.

External Consistency

Spanner’s most famous guarantee is external consistency.

External consistency says:

If transaction \(T_1\) commits before transaction \(T_2\) begins in real time, then \(T_1\)’s commit timestamp must be less than \(T_2\)’s commit timestamp.

So the order of transactions in the database must respect real time.

This is the transaction-level version of linearizability.

A reminder of linearizability is this: if one operation finishes before another starts in real time, then the system must behave as though the first happened before the second.

Linearizability is usually discussed for single operations on shared objects. External consistency extends that same real-time ordering idea to transactions.

It is also important to emphasize that this is stronger than ordinary serializability.

Serializability says that transactions must be equivalent to some serial order. But that serial order does not have to match wall-clock time.

External consistency requires more:

So external consistency is stronger than ordinary serializability. It is often described as strict serializability, or linearizability at the level of transactions.

Example: Bank Transfer and Later Read

Suppose a bank transfer transaction runs in one region. It moves money from checking to savings and then returns “commit successful” to the client.

A moment later, another transaction starts in a different region and reads the account balances.

With external consistency, the second transaction must be ordered after the first one. It cannot receive an earlier timestamp and see the pre-transfer state after the transfer has already committed.

That is the behavior we associate with transaction-level linearizability.

It is also useful to see what could go wrong without this guarantee. In a system that was merely serializable, the transactions could still be placed in some serial order, but that order might not match real time. The second transaction might be serialized before the first, even though in real life it started later. External consistency rules that out.

Why Clocks Are a Problem

At first glance, external consistency sounds like a simple clock problem: stamp each transaction with the current time.

But clocks in distributed systems are not perfectly synchronized. Different machines may disagree about the current time by several milliseconds or more. One machine’s clock may run slightly fast, while another’s runs slightly slow.

That creates a serious problem. A transaction that actually commits later on one server could be assigned an earlier timestamp than a transaction that committed first on another server.

So Spanner cannot rely on naive local clock readings.

TrueTime

Spanner’s solution is TrueTime, a time API that returns a bounded interval instead of one exact timestamp.

\[TT.now() = [ earliest, latest ]\]

The real current time is guaranteed to lie somewhere inside that interval.

That means:

The width of the interval is the clock uncertainty, often written as \(\epsilon\).

This is the critical idea: Spanner does not require perfectly synchronized clocks. It only requires the uncertainty to be bounded.

Google tries to keep this uncertainty small by synchronizing clocks using a GPS receiver at each data center, with an atomic clock as a backup reference. Those sources help keep the local clocks tightly aligned and keep the uncertainty interval narrow enough to be practical.

Commit Wait

TrueTime by itself is not enough. The key extra step is commit wait.

When a read-write transaction is ready to commit, Spanner does the following:

  1. Acquire all locks.

  2. Do the transaction’s work.

  3. Choose a commit timestamp: \(t = TT.now().latest\).

  4. Wait until TT.now().earliest > t.

  5. Commit.

  6. Release locks.

The important step is the wait.

When Spanner chooses \(t = TT.now().latest\), it is choosing a timestamp that is guaranteed not to be earlier than the real current time. But the system still does not know exactly where real time lies inside the interval. So it waits.

Once TT.now().earliest > t, the system knows that real time has definitely advanced past \(t\). Only then is the transaction allowed to become visible.

That is what makes the commit timestamp safe as a real-time ordering point.

Why Commit Wait Gives External Consistency

Suppose transaction \(T_1\) chooses commit timestamp 100.

At that instant, Spanner does not yet know whether real time has definitely passed 100 everywhere. So it waits.

Once TT.now().earliest > 100, Spanner knows that 100 is now definitely in the past. It can safely expose the transaction.

Now suppose transaction \(T_2\) starts after \(T_1\) commits.

Since \(T_1\) was not made visible until time 100 was definitely in the past, \(T_2\) cannot be assigned timestamp 99 or 100. It must get a later timestamp.

That is the real-time ordering needed for external consistency.

Commit Wait and Replication

Commit wait sounds expensive, but Spanner hides much of the cost by overlapping it with replication.

The sequence is:

  1. Acquire all locks.

  2. Do the transaction’s work.

  3. Choose a commit timestamp: \(t = TT.now().latest\).

  4. In parallel:

  5. Start Paxos consensus so the commit is replicated durably.

  6. Perform commit wait so real time advances past \(t\).

  7. Once both are finished, commit.

  8. Release locks.

This sequence is important because it ties together two major pieces of the design.

Paxos makes the commit durable and consistent across replicas.

Commit wait makes the chosen timestamp safe with respect to real time.

Because those two activities happen in parallel, the latency cost of commit wait is smaller than it first appears.

The additional delay from commit wait is small. In Google’s environment, thanks to GPS receivers and atomic clocks at each data center, the uncertainty window is typically only a few milliseconds, and the average commit wait time is on the order of 4 ms.

Putting the Pieces of Spanner Together

Spanner is easiest to understand by separating the role of each mechanism:

Think of Spanner as Bigtable-style key-range partitioning plus distributed transactions plus globally meaningful time.

Key Points about Spanner

Spanner and Bigtable are not just a research or internal system. Google made Spanner available commercially as Google Cloud Spanner, while Bigtable is available as Google Cloud Bigtable. That makes the contrast between them especially useful: Bigtable exposes the wide-column, key-range partitioned storage model directly, while Spanner builds a globally distributed relational database on top of similar partitioning ideas, adding transactions, SQL, and externally consistent timestamps.


References