GFS and HDFS
- Chunk
- A fixed-size (64 MB) unit of file storage in GFS; the basic unit of replication and placement across chunkservers.
- Chunkserver
- A storage node in GFS that stores chunk replicas as ordinary files on its local disk and serves them directly to clients.
- Master (GFS)
- The single node in a GFS cluster that holds all file system metadata in memory, manages chunk placement, and grants leases to primary replicas.
- NameNode
- The HDFS equivalent of the GFS master; stores the file system namespace and block-to-DataNode mappings in memory.
- DataNode
- The HDFS equivalent of a GFS chunkserver; stores block replicas on local disk and serves them to clients.
- Operation log
- The persistent, replicated record of all metadata changes in GFS; used to reconstruct master state after a crash.
- Checkpoint (GFS)
- A snapshot of the master’s in-memory state written to disk periodically so that log replay after a crash starts from a recent state rather than the beginning.
- Lease (GFS)
- A time-limited grant from the master to one replica designating it as the primary for a chunk; the primary serializes concurrent mutations.
- Primary replica
- The replica holding the current lease for a chunk; responsible for assigning serial numbers to mutations and coordinating secondaries.
- Two-phase write
- GFS’s write protocol in which data transfer (pipelining to all replicas) is separated from the write request (sent to the primary after all replicas acknowledge receipt).
- Record append
- A GFS operation in which GFS chooses the offset and guarantees that data is written atomically at least once; supports concurrent multi-writer append patterns.
- Heartbeat (GFS)
- Periodic message from a chunkserver to the master confirming liveness; used to detect chunkserver failures and trigger re-replication.
- Checksum (GFS)
- Per-block integrity data stored by each chunkserver alongside chunk data; used to detect silent data corruption on read or during background scans.
- Re-replication
- The process by which the GFS master instructs surviving chunkservers to create new copies of under-replicated chunks after a chunkserver failure.
- HDFS Federation
- An HDFS feature that allows multiple independent NameNodes to manage separate portions of the namespace, removing the single-NameNode scalability limit.
- Colossus
- Google’s successor to GFS, which distributes the metadata function across multiple nodes to remove the single-master bottleneck.
Distributed Hash Tables
- Distributed Hash Table (DHT)
- A decentralized system that distributes a key-value store across many nodes such that any node can route a lookup to the node responsible for a given key without a central directory.
- Consistent hashing
- A hashing scheme that maps both keys and nodes to positions in a shared identifier space so that adding or removing a node only affects the keys assigned to it, minimizing data movement.
- Identifier space
- The range of possible hash values (e.g., 0 to 2^160 - 1 for SHA-1) that keys and nodes are mapped into; typically visualized as a ring.
- CAN (Content Addressable Network)
- A DHT that partitions a multi-dimensional Cartesian coordinate space among nodes; keys are hashed to points and routed greedily toward their coordinates.
- Zone (CAN)
- A rectangular region of the coordinate space owned by a single CAN node; all zones together partition the space completely.
- Chord
- A DHT that places nodes and keys on a one-dimensional ring using consistent hashing; keys are assigned to their successor node, and lookups use finger tables for O(log n) routing.
- Successor (Chord)
- The node with the smallest identifier greater than or equal to a given key’s hash; the node responsible for storing that key.
- Finger table
- A routing table at each Chord node containing pointers to nodes at exponentially increasing distances around the ring, enabling O(log n)-hop lookup.
- Stabilization (Chord)
- The background protocol that keeps Chord’s successor pointers and finger tables consistent as nodes join and leave the ring.
- Virtual node (vnode)
- A logical position on the consistent-hashing ring assigned to a physical server; each server owns many vnodes, improving load balance and distributing the impact of failures.
- Dynamo
- Amazon’s internal key-value store that uses consistent hashing with virtual nodes, N-way replication, quorum reads/writes, eventual consistency, and application-level conflict resolution.
- Preference list
- In Dynamo, the ordered list of N nodes responsible for storing replicas of a given key; determined by walking clockwise from the key’s position on the ring.
- Quorum (Dynamo)
- A read/write policy parameterized by R (replicas needed for a read) and W (replicas needed for a write); R + W > N guarantees that reads always see the most recent write.
- Eventual consistency
- A consistency model in which all replicas of a value will converge to the same state if no new writes occur, but reads may return stale values in the meantime.
- Vector clock
- A data structure that tracks the causal history of a value by recording a counter per node that has written it; used by Dynamo to detect and present conflicting versions to the application.
- Hinted handoff
- A Dynamo mechanism in which a write intended for an unavailable replica is stored temporarily on a different node with a hint indicating the intended destination; the substitute node forwards the write when the target recovers.
- Gossip protocol
- A peer-to-peer information dissemination protocol in which each node periodically exchanges state with a randomly selected peer; used in Dynamo for membership and failure detection.
DNS
- DNS (Domain Name System)
- A hierarchical, distributed, cacheable database that maps domain names to IP addresses and other resource records, serving as the internet’s naming infrastructure.
- Stub resolver
- A minimal DNS client built into the operating system that forwards queries to a configured recursive resolver and returns the final answer to the application; it does not walk the hierarchy itself.
- Recursive resolver
- A DNS server that resolves names on behalf of clients by iteratively querying root, TLD, and authoritative servers until it obtains the final answer.
- Authoritative name server
- A DNS server that holds the definitive records for a zone and answers queries for names within that zone without consulting other servers.
- Root name server
- A DNS server at the top of the resolution hierarchy that knows the authoritative servers for each top-level domain; there are 13 logical root server addresses, served by hundreds of physical machines worldwide via anycast.
- TLD (Top-Level Domain)
- The rightmost label in a domain name (e.g.,
.edu,.com,.org); managed by designated TLD operators who maintain authoritative servers for all second-level domains within the TLD. - Delegation
- The mechanism by which a zone owner assigns authority for a subdomain to a different set of name servers, enabling the DNS hierarchy to be managed in a distributed, decentralized way.
- Zone
- A contiguous portion of the DNS namespace for which a specific set of authoritative name servers is responsible.
- TTL (Time To Live)
- A value in a DNS response that specifies how long a resolver may cache the response before discarding it and querying again; the tunable knob controlling the trade-off between freshness and query load.
- Iterative resolution
- A DNS resolution strategy in which the recursive resolver performs each step itself, receiving referrals from servers rather than having servers chain queries on its behalf.
- Cache poisoning
- An attack in which a malicious party injects forged DNS responses into a resolver’s cache, redirecting clients to attacker-controlled addresses.
- DNSSEC
- DNS Security Extensions; a suite of DNS extensions that use cryptographic signatures to allow resolvers to verify that DNS responses are authentic and unmodified.
- Registry
- An organization that operates the authoritative database for a TLD, maintaining the definitive list of registered domains and their delegated nameservers; Verisign, for example, is the registry for
.comand.net. - Registrar
- An ICANN-accredited company that sells domain name registrations to the public on behalf of a registry; examples include GoDaddy, Namecheap, and Domain.com.