Data In Motion

Message Queues and Event Streaming

Message broker: A service that sits between producers and consumers, storing messages durably and allowing each side to operate independently at its own pace.
Producer (publisher): A process that writes messages to a message broker without knowing which consumers will receive them.
Consumer (subscriber): A process that reads messages from a message broker without knowing which producers wrote them.
Topic: A named category or stream of messages; producers write to a topic and consumers subscribe to one.
AMQP (Advanced Message Queuing Protocol): The protocol underlying RabbitMQ, which defines exchanges, queues, and binding-based routing.
Exchange: In RabbitMQ, the component that receives messages from producers and routes them to one or more queues according to binding rules.
Binding: In RabbitMQ, a configured link from an exchange to a queue; optionally carries a binding key that the exchange matches against incoming message routing keys to decide whether to route a message to that queue.
Fanout exchange: A RabbitMQ exchange type that broadcasts every message to all queues bound to it, ignoring routing keys.
Direct exchange: A RabbitMQ exchange type that routes a message to queues whose binding key exactly matches the message’s routing key.
Topic exchange: A RabbitMQ exchange type that routes messages based on wildcard pattern matching against routing keys.
Message acknowledgment: A signal sent by a consumer to the broker confirming successful message processing; the broker retains unacknowledged messages and redelivers them on consumer failure.
At-most-once delivery: A messaging guarantee in which a message is sent once and not retried; fast, but the message is lost if the consumer is unavailable or a failure occurs.
At-least-once delivery: A messaging guarantee in which the system retries until it receives an acknowledgment; no message is lost, but duplicates are possible. Consumers must be idempotent or deduplicate.
Exactly-once effect: The observable guarantee that each message produces its intended effect exactly once, achieved by combining at-least-once delivery with idempotent or transactional processing at the consumer. Requires cooperation from the source, the broker, and the output destination; not a property of the broker alone.
Backpressure: The condition in which a consumer or downstream system cannot keep up with the rate at which a producer generates data; addressed through buffering (absorb bursts in a queue), dropping (discard when the buffer is full), or flow control (signal the producer to slow down).
Partition: In Kafka, an ordered log within a topic that grows only by appending; once a record is written it is never modified or overwritten. Each partition is stored on a single broker and independently replicated.
Offset: In Kafka, a sequential integer that uniquely identifies a record’s position within a partition; consumers track their own offsets to control their position in the log.
Consumer group: In Kafka, a named set of consumers that collectively consume a topic; each partition is assigned to exactly one group member at a time, distributing work across the group.
Log compaction: A Kafka retention policy in which the broker keeps only the most recent record for each key, discarding older records with the same key during background processing. Produces a log that always contains the latest value per key rather than a complete history.
Event sourcing: An architectural pattern in which all state changes are recorded as an ordered log of events, allowing state to be reconstructed by replaying the log.
Sequential I/O: Disk access that reads or writes a continuous stream of data without seeking; orders of magnitude faster than random I/O and the basis for Kafka’s performance.
Page cache: An operating system mechanism that caches recently accessed disk blocks in RAM; Kafka exploits this to serve reads at memory speeds without a separate in-memory cache.
In-sync replica (ISR): In Kafka, a follower that is sufficiently caught up with the leader; the acks=all setting requires all ISR members to confirm a write before the producer receives an acknowledgment.
Event time: The timestamp at which an event actually occurred, as recorded by the source system; the correct basis for time-based aggregations.
Processing time: The timestamp at which the stream processing system receives and processes an event; easier to implement than event time but incorrect when data arrives late or out of order.
Tumbling window: A fixed-size, non-overlapping time window; each event belongs to exactly one window.
Sliding window: A fixed-size time window that advances by a configurable step smaller than the window size, producing overlapping windows in which some events appear multiple times.
Session window: A time window that groups events separated by less than a configurable inactivity gap; window size is not fixed and reflects natural bursts of user activity.
Watermark: A progress estimate in stream processing: the system treats events earlier than the watermark timestamp as sufficiently unlikely to arrive that it will wait for them no longer, and uses it to decide when to close a window and emit results. Derived by subtracting a configured lag from the latest event timestamp seen.
Micro-batch: The execution model used by Spark Structured Streaming, in which events are collected into small batches that are processed as a series of short batch jobs rather than one event at a time.
Unbounded table: The Spark Structured Streaming abstraction that treats an incoming stream as a table that grows indefinitely; users write queries against it using standard Spark APIs.
Checkpoint (streaming): A durable snapshot of a stream processor’s progress (offsets) and accumulated state, saved periodically so the system can recover from a failure by replaying from the last checkpoint rather than starting over. Provides at-least-once semantics; exactly-once additionally requires an idempotent output destination.
Sink: In stream processing, the output destination where results are written; must support idempotent writes or transactional commits to achieve exactly-once semantics end-to-end.
Output mode: In Spark Structured Streaming, the policy for what portion of the result table is written to the sink on each trigger: append (new rows only), complete (full result table), or update (changed rows only).

Content Delivery Networks

Flash crowd: A sudden large surge in demand for a resource, typically caused by a news event or popular content release, that overwhelms the capacity of a single origin server.
Origin server: The content provider’s authoritative server; the source of truth for all content in a CDN.
Edge server: A CDN server located close to end users, typically inside ISPs or at internet exchange points, that serves cached content to reduce latency and origin load.
Parent server: A CDN server in the tier between edge servers and the origin, used as a shared cache for edge servers in a region to reduce repeated fetches from the origin.
Push CDN: A CDN model in which the content provider explicitly uploads content to CDN storage nodes ahead of demand.
Pull CDN: A CDN model in which edge servers fetch content from the origin on the first request and cache it for subsequent requests.
CNAME (Canonical Name): A DNS record type that maps one hostname to another; used by CDN customers to delegate their domain name to the CDN’s DNS infrastructure.
Dynamic DNS: A DNS server that returns different IP addresses for the same hostname based on real-time factors such as user location, server load, and network conditions.
Tiered distribution: The CDN content lookup strategy in which a cache miss at the edge triggers a search through progressively higher cache tiers (regional peers → parent → origin) before reaching the origin.
Cache-Control: An HTTP response header that instructs caches (browsers, proxies, CDN edge servers) how to store and validate a resource; directives include max-age, no-store, no-cache, public, and private.
Edge Side Includes (ESI): A markup language that allows a CDN to assemble a page from independently cached fragments at the edge, enabling partial caching of pages that include some dynamic content.
HTTP Live Streaming (HLS): An Apple-developed protocol that delivers video by breaking it into short segments served as regular HTTP files; each segment can be cached and served by CDN edge servers.
MPEG-DASH (Dynamic Adaptive Streaming over HTTP): An open international standard for adaptive bitrate video streaming, used by Netflix, YouTube, Amazon Prime Video, and most other major platforms on non-Apple devices. Like HLS, it delivers video as short HTTP segments with a manifest file; unlike HLS, it is codec-agnostic and not controlled by a single company.
Adaptive bitrate (ABR): A video delivery technique that encodes content at multiple quality levels; the player automatically selects the appropriate level based on current network conditions, allowing graceful degradation on slow connections.
Overlay network: An application-level network built on top of the public internet, used by CDNs to route traffic between nodes along paths selected by measured performance rather than BGP routing policy.
Anycast: A network addressing scheme in which multiple servers worldwide share a single IP address; BGP routing directs each client’s connection to the nearest server advertising that address, based on routing path length.
BGP (Border Gateway Protocol): The routing protocol that governs how traffic flows between autonomous networks (ISPs) on the internet; used both by CDNs for anycast routing and as a source of topology information for DNS-based routing decisions.
Distributed denial-of-service (DDoS): An attack that floods a target with traffic from many sources to exhaust its bandwidth or processing capacity; CDNs mitigate DDoS by absorbing attack traffic across their distributed infrastructure.
TLS termination: Decrypting an HTTPS connection at a CDN edge server before forwarding the request to the origin; reduces handshake latency for users and offloads cryptographic work from the origin.

BitTorrent

Swarm: In BitTorrent, the collection of all peers currently downloading or seeding a particular file.
Piece: In BitTorrent, a fixed-size chunk of the file being distributed; each piece has an associated hash for integrity verification.
Seeder: A BitTorrent peer that has the complete file and is uploading pieces to other peers.
Leecher: A BitTorrent peer that is downloading a file and has not yet acquired all pieces; leechers upload the pieces they have while continuing to download.
Rarest-first: The BitTorrent piece selection strategy in which a peer preferentially downloads pieces that the fewest other peers currently have, ensuring that rare pieces are quickly distributed through the swarm.
Tracker: In BitTorrent, a server that maintains lists of peers participating in a swarm and responds to peer discovery queries.

Edge Computing

V8 isolate: A lightweight sandbox used by Cloudflare Workers to execute JavaScript; isolates provide memory isolation between concurrent workers without the overhead of separate processes, and can be initialized in microseconds.
Edge computing: The practice of executing application logic on CDN edge nodes close to users rather than at a centralized origin, reducing round-trip latency for dynamic operations.

Data In Motion

Message Queues and Event Streaming

Content Delivery Networks

BitTorrent

Edge Computing

Back to CS 417 Documents