pk.org: CS 417/Lecture Notes

Data In Motion

Terms you should know

Paul Krzyzanowski – 2026-04-14

Message Queues and Event Streaming

Message broker
A service that sits between producers and consumers, storing messages durably and allowing each side to operate independently at its own pace.
Producer (publisher)
A process that writes messages to a message broker without knowing which consumers will receive them.
Consumer (subscriber)
A process that reads messages from a message broker without knowing which producers wrote them.
Topic
A named category or stream of messages; producers write to a topic and consumers subscribe to one.
AMQP (Advanced Message Queuing Protocol)
The protocol underlying RabbitMQ, which defines exchanges, queues, and binding-based routing.
Exchange
In RabbitMQ, the component that receives messages from producers and routes them to one or more queues according to binding rules.
Binding
In RabbitMQ, a configured link from an exchange to a queue; optionally carries a binding key that the exchange matches against incoming message routing keys to decide whether to route a message to that queue.
Fanout exchange
A RabbitMQ exchange type that broadcasts every message to all queues bound to it, ignoring routing keys.
Direct exchange
A RabbitMQ exchange type that routes a message to queues whose binding key exactly matches the message’s routing key.
Topic exchange
A RabbitMQ exchange type that routes messages based on wildcard pattern matching against routing keys.
Message acknowledgment
A signal sent by a consumer to the broker confirming successful message processing; the broker retains unacknowledged messages and redelivers them on consumer failure.
At-most-once delivery
A messaging guarantee in which a message is sent once and not retried; fast, but the message is lost if the consumer is unavailable or a failure occurs.
At-least-once delivery
A messaging guarantee in which the system retries until it receives an acknowledgment; no message is lost, but duplicates are possible. Consumers must be idempotent or deduplicate.
Exactly-once effect
The observable guarantee that each message produces its intended effect exactly once, achieved by combining at-least-once delivery with idempotent or transactional processing at the consumer. Requires cooperation from the source, the broker, and the output destination; not a property of the broker alone.
Backpressure
The condition in which a consumer or downstream system cannot keep up with the rate at which a producer generates data; addressed through buffering (absorb bursts in a queue), dropping (discard when the buffer is full), or flow control (signal the producer to slow down).
Partition
In Kafka, an ordered log within a topic that grows only by appending; once a record is written it is never modified or overwritten. Each partition is stored on a single broker and independently replicated.
Offset
In Kafka, a sequential integer that uniquely identifies a record’s position within a partition; consumers track their own offsets to control their position in the log.
Consumer group
In Kafka, a named set of consumers that collectively consume a topic; each partition is assigned to exactly one group member at a time, distributing work across the group.
Log compaction
A Kafka retention policy in which the broker keeps only the most recent record for each key, discarding older records with the same key during background processing. Produces a log that always contains the latest value per key rather than a complete history.
Event sourcing
An architectural pattern in which all state changes are recorded as an ordered log of events, allowing state to be reconstructed by replaying the log.
Sequential I/O
Disk access that reads or writes a continuous stream of data without seeking; orders of magnitude faster than random I/O and the basis for Kafka’s performance.
Page cache
An operating system mechanism that caches recently accessed disk blocks in RAM; Kafka exploits this to serve reads at memory speeds without a separate in-memory cache.
In-sync replica (ISR)
In Kafka, a follower that is sufficiently caught up with the leader; the acks=all setting requires all ISR members to confirm a write before the producer receives an acknowledgment.
Event time
The timestamp at which an event actually occurred, as recorded by the source system; the correct basis for time-based aggregations.
Processing time
The timestamp at which the stream processing system receives and processes an event; easier to implement than event time but incorrect when data arrives late or out of order.
Tumbling window
A fixed-size, non-overlapping time window; each event belongs to exactly one window.
Sliding window
A fixed-size time window that advances by a configurable step smaller than the window size, producing overlapping windows in which some events appear multiple times.
Session window
A time window that groups events separated by less than a configurable inactivity gap; window size is not fixed and reflects natural bursts of user activity.
Watermark
A progress estimate in stream processing: the system treats events earlier than the watermark timestamp as sufficiently unlikely to arrive that it will wait for them no longer, and uses it to decide when to close a window and emit results. Derived by subtracting a configured lag from the latest event timestamp seen.
Micro-batch
The execution model used by Spark Structured Streaming, in which events are collected into small batches that are processed as a series of short batch jobs rather than one event at a time.
Unbounded table
The Spark Structured Streaming abstraction that treats an incoming stream as a table that grows indefinitely; users write queries against it using standard Spark APIs.
Checkpoint (streaming)
A durable snapshot of a stream processor’s progress (offsets) and accumulated state, saved periodically so the system can recover from a failure by replaying from the last checkpoint rather than starting over. Provides at-least-once semantics; exactly-once additionally requires an idempotent output destination.
Sink
In stream processing, the output destination where results are written; must support idempotent writes or transactional commits to achieve exactly-once semantics end-to-end.
Output mode
In Spark Structured Streaming, the policy for what portion of the result table is written to the sink on each trigger: append (new rows only), complete (full result table), or update (changed rows only).

Content Delivery Networks

Flash crowd
A sudden large surge in demand for a resource, typically caused by a news event or popular content release, that overwhelms the capacity of a single origin server.
Origin server
The content provider’s authoritative server; the source of truth for all content in a CDN.
Edge server
A CDN server located close to end users, typically inside ISPs or at internet exchange points, that serves cached content to reduce latency and origin load.
Parent server
A CDN server in the tier between edge servers and the origin, used as a shared cache for edge servers in a region to reduce repeated fetches from the origin.
Push CDN
A CDN model in which the content provider explicitly uploads content to CDN storage nodes ahead of demand.
Pull CDN
A CDN model in which edge servers fetch content from the origin on the first request and cache it for subsequent requests.
CNAME (Canonical Name)
A DNS record type that maps one hostname to another; used by CDN customers to delegate their domain name to the CDN’s DNS infrastructure.
Dynamic DNS
A DNS server that returns different IP addresses for the same hostname based on real-time factors such as user location, server load, and network conditions.
Tiered distribution
The CDN content lookup strategy in which a cache miss at the edge triggers a search through progressively higher cache tiers (regional peers → parent → origin) before reaching the origin.
Cache-Control
An HTTP response header that instructs caches (browsers, proxies, CDN edge servers) how to store and validate a resource; directives include max-age, no-store, no-cache, public, and private.
Edge Side Includes (ESI)
A markup language that allows a CDN to assemble a page from independently cached fragments at the edge, enabling partial caching of pages that include some dynamic content.
HTTP Live Streaming (HLS)
An Apple-developed protocol that delivers video by breaking it into short segments served as regular HTTP files; each segment can be cached and served by CDN edge servers.
MPEG-DASH (Dynamic Adaptive Streaming over HTTP)
An open international standard for adaptive bitrate video streaming, used by Netflix, YouTube, Amazon Prime Video, and most other major platforms on non-Apple devices. Like HLS, it delivers video as short HTTP segments with a manifest file; unlike HLS, it is codec-agnostic and not controlled by a single company.
Adaptive bitrate (ABR)
A video delivery technique that encodes content at multiple quality levels; the player automatically selects the appropriate level based on current network conditions, allowing graceful degradation on slow connections.
Overlay network
An application-level network built on top of the public internet, used by CDNs to route traffic between nodes along paths selected by measured performance rather than BGP routing policy.
Anycast
A network addressing scheme in which multiple servers worldwide share a single IP address; BGP routing directs each client’s connection to the nearest server advertising that address, based on routing path length.
BGP (Border Gateway Protocol)
The routing protocol that governs how traffic flows between autonomous networks (ISPs) on the internet; used both by CDNs for anycast routing and as a source of topology information for DNS-based routing decisions.
Distributed denial-of-service (DDoS)
An attack that floods a target with traffic from many sources to exhaust its bandwidth or processing capacity; CDNs mitigate DDoS by absorbing attack traffic across their distributed infrastructure.
TLS termination
Decrypting an HTTPS connection at a CDN edge server before forwarding the request to the origin; reduces handshake latency for users and offloads cryptographic work from the origin.

BitTorrent

Swarm
In BitTorrent, the collection of all peers currently downloading or seeding a particular file.
Piece
In BitTorrent, a fixed-size chunk of the file being distributed; each piece has an associated hash for integrity verification.
Seeder
A BitTorrent peer that has the complete file and is uploading pieces to other peers.
Leecher
A BitTorrent peer that is downloading a file and has not yet acquired all pieces; leechers upload the pieces they have while continuing to download.
Rarest-first
The BitTorrent piece selection strategy in which a peer preferentially downloads pieces that the fewest other peers currently have, ensuring that rare pieces are quickly distributed through the swarm.
Tracker
In BitTorrent, a server that maintains lists of peers participating in a swarm and responds to peer discovery queries.

Edge Computing

V8 isolate
A lightweight sandbox used by Cloudflare Workers to execute JavaScript; isolates provide memory isolation between concurrent workers without the overhead of separate processes, and can be initialized in microseconds.
Edge computing
The practice of executing application logic on CDN edge nodes close to users rather than at a centralized origin, reducing round-trip latency for dynamic operations.

Back to CS 417 Documents