Exam 3 study guide

The one-hour study guide for exam 3

Paul Krzyzanowski

Latest update: Wed Dec 12 13:40:19 EST 2018

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.

MapReduce

Go here for lecture notes

Goal: Create a massively parallel software framework to make it easy to program classes of problems that use big data that can be parsed into (key, value) pairs. Each unique key is then processed with a list of all values that were found for it.

MapReduce is a framework for parallel computing. Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). It is designed for problems that lend themselves to a map-reduce technique, which we will discuss. From a user’s perspective, there are two basic operations in MapReduce: Map and Reduce.

The Map function reads data and parses it into intermediate (key, value) pairs. When that is complete and all (key, value) sets are generated, the Reduce function is called once for each unique key that was generated by Map and is given that key along with a list of all values that were generated for that key as parameters. The keys are presented in sorted order.

The MapReduce client library creates a master process and many worker processes on a large set of machines. The master serves as a coordinator and is responsible for assigning roles to workers, keeping track of progress, and detecting failed workers. The set of inputs is broken up into chunks called shards. The master assigns map tasks to idle workers and gives each of them a unique shard to work on. The map worker invokes a user’s map function, which reads and parses the data in the shard and emits intermediate (key, value) results.

The intermediate (key, value) data is partitioned based on the key according to a partitioning function that determines which of R reduce workers will work on that key and its associated data. The default function is simply hash(key) mod R that [usually] distributes keys uniformly among R reduce workers but a user can replace it with a custom function. These partitioned results are stored in intermediate files on the map worker, grouped by key. All map workers must use the same partitioning function since the same set of keys must go to the same reduce workers. Map workers work in parallel and there is no need for any one to communicate with another.

After the mapping phase completes and all (key, value) data is generated, we need to get all values that share the same key to a single reduce worker. The process of moving the data to reduce workers and aggregating it by keys is called shuffling.

When all map workers inform the master that they are complete, the master dispatches the reduce workers. Each reduce worker contacts all the map worker nodes to get the set of (key, value) data that was targeted for them. The combined data for each key is then merged to create a single sorted list of values for each key and the user reduce function gets called once for each unique key. The user’s reduce function is passed the key and the list of all values associated with that key. The reduce function writes its output to an output file, which the client can read once the MapReduce operation is complete.

The master periodically pings each worker for liveness. If no response is received within a time limit, then the master reschedules and restarts that worker’s task onto another worker.

Bigtable

Go here for lecture notes

Goal: Build a massively scalable, eventually consistent table of data that is indexed by a row key, contains a potentially arbitrary and huge number of columns, and can be distributed among tens of thousands of computers.

Bigtable is a distributed storage system developed at Google that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed and used for storing items such as billions of web pages indexed by URL, with many versions per page; hundreds of terabytes of satellite image data; hundreds of millions of users. It is also designed to be able to handle thousands of queries a second.

To the user, an instance of Bigtable appears as a large spreadsheet: a table that is indexed by a row key, column name, and timestamp. Each cell can store multiple timestamped versions of data. If an application does not specify a timestamp, it will get the latest version. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp. All operations on rows are atomic but only on one row at a time. That is, a client cannot grab a lock for a set of rows in order to make changes.

Columns in the table are organized into column families. A column family is a related group of columns (e.g., “outgoing links”). Each column family has a unique name and contains within it a list of named columns. A Bigtable instance will typically have a small number of column families (perhaps a few hundred at most) but each column family may have a huge number (perhaps millions) of columns within it. Bigtable tables are generally sparse. If a cell (a column in a row) is empty, it will not use up any space. Column families are configured as part of the table and are common to all rows. Columns within a column family may be specific to a row and can be added dynamically. An example of columns within a column family is a list of all the websites you visited (i.e., many thousands of columns). The list of columns used between one row and another may be wildly different. Each cell may contain multiple versions of data. Timestamped versioning is configured on a per-column-family basis.

A Bigtable table starts off as a single table. As rows are added, the table is always kept sorted by row key. As the table grows, it may split along row boundaries into sub-tables, called tablets. Tablet servers are responsible for serving data for tablets. If a tablet gets too many queries, Bigtable can balance the workload by splitting the tablet and moving one of the new tablets to a different server.

Bigtable comprises a client library (linked with the user’s code), a master server that coordinates activity, and many tablet servers. Each tablet server is responsible for managing multiple tablets. Depending on the size of the tablets, it may manage from tens to thousands of tablets. Tablet servers can be added or removed dynamically. Tablet data is stored within GFS (Google File System) and tablet servers run on the same machines as GFS chunkservers. A tablet server handles read/write requests for its tablets and splits tablets when they grow large.

New data to Bigtable is stored on the tablet server in a memory structure called a memtable. Once the size of the memtable gets larger than some threshold value, it is frozen and converted to an SSTable. The SSTable is the file format used to store Bigtable data, managing the mapping of keys to their values.

The master assigns tablets to tablet servers, balances tablet server load, and handles schema changes (table and column family creations). It tries to run the tablet server on the same GFS chunkserver that holds data for that tablet. The master is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).

In Bigtable, Chubby is used to ensure there is only one active master, store the bootstrap location of Bigtable data, discover tablet servers, store Bigtable schema information, and store access control lists. Chubby stores the location of the root tablet, which is the top level of the three-layer-deep balanced tree. The top two layers of the tree are metadata tablets, which do not contain table data but only information to locate the tablet that is responsible for a particular key.

Bigtable relies on GFS for protecting its data from loss but can be configured for replication to multiple Bigtable clusters in different data centers to ensure availability. Data propagation is asynchronous and results in an eventually consistent model.

Spanner

Go here for lecture notes

Goal: Design a huge-scale worldwide database that provides ACID semantics, read-free locking, and external consistency.

Introduction

Brewer’s CAP theorem led to the popularity of an eventual consistency model rather than relying on transactional ACID semantics.

In this model, writes propagate through the system so that all replicated copies of data will eventually be consistent. Prior to the completion of all updates, processes may access old versions of the data if they are reading from a replica that has not been updated yet.

With ACID semantics, this would not happen since the transaction would grab a lock on all the replicas prior to updating them. Eventual consistency is the trade-off in designing a system that is both highly available and can survive partitioning.

The presence of network partitions is, in many environments, a rare event. By using an eventual consistency model, we choose to give up on “C” (consistency) in order to gain availability on the slim chance that the network becomes partitioned. Given that partitions are rare in places such as Google data centers, which uses redundant networks both within and outside each center, it is not unreasonable to enforce a strong consistency model (which requires mutual exclusion via locks). Partitioning will rarely affect the entire system but only a subset of computers. As such, it is possible to use a majority consensus algorithm, such as Paxos, to continue operations as long as the majority of replicas are functioning. Spanner takes advantage of an environment where partitions are rare and is willing to be partially unavailable in that rare event when a network becomes partitioned.

Experience with eventual consistency has taught us that it places a greater burden on the programmer. With an eventual consistency model, it is now up to the programmer to reconcile the possibility that some data that is being accessed may be stale while other data might be current. Bigtable, for example, was difficult to use in applications that required strong consistency. Custom code had to be written to coordinate locks for any changes that spanned multiple rows.

With Spanner, Google has built a transactional (ACID) database that spans many systems and widely distributed data centers. Spanner provides the user with a large-scale database that comprises multiple tables, with each table containing multiple rows and columns. The model is semi-relational: each table is restricted to having a single primary key. Unlike Bigtable, transactions can span multiple rows. Also unlike Bigtable, which provided eventually consistent replication, Spanner offers synchronous replication, ACID semantics, and lock-free reads.

Data storage

Spanner gives the user the the ability to manage multiple tables, each with rows and columns. Internally, the data is stored as a large keyspace that is sharded across multiple servers. Like Bigtable, each shard stores a group of consecutive of rows, called a tablet. This sharding is invisible to applications.

Tablets are replicated synchronously using Paxos. One of the replicas is elected to be a leader and runs a transaction manager. Any transactions that span multiple shards use the two-phase commit protocol. Replication is performed within a transaction and all replicas remain locked until replication is complete, ensuring a consistent view of the database.

Applications can specify geographic data placement and the amount of replication:

  • Proximity of data to users (impacts read latency)

  • Proximity of replicas to each other (impacts write latency)

  • Amount of replication (impacts availability and read performance)

Spanner was designed for a global scale and a database will span multiple data centers around the globe. In a data center, spanservers store tablets. A zonemaster periodically rebalances tablets across servers to balance load. The zonemaster does not participate in transactions.

Transactions

Spanner provides transactions with ACID semantics. Transactions are serialized to satisfy the “I” (isolated) property and to create the illusion that one transaction happens after another. Strict two-phase locking is used to accomplish this.

Transactions can be distributed among multiple systems (which may span multiple datacenters). Each transaction is assigned a globally-meaningful timestamp and these timestamps reflect the serialization order of the transactions. Once a transaction has acquired all the locks it needs, it does its work and then picks a commit timestamp. Two-phase locking can reduce overall perfomance because other transactions may need to wait for locks to be released before they can access resources. Spanner uses separate read and write locks but even these can often block another transaction. A read lock will cause any writers to block and a write lock will cause any other writers or readers to block.

To provide lock-free reads, Spanner implements multiversion concurrency control by keeping old versions of data. A transaction reads data from a snapshot, an earlier point in time, without getting a lock. This is particularly useful for long-running transactions that read that many rows of the database. Any other ongoing transactions that modify that data will not affect the data that the long-running transaction reads since those will be later versions.

External consistency

Serialization (isolation, the I in ACID) simply requires that transactions behave as if they executed in some serial order. Spanner implements a stronger consistency model, external consistency, which means that the order of transactions reflects their true time order. Specifically, if a transaction T1 commits before a transaction T2 starts, based on physical (also called “wall clock”) time, then the serial ordering of commits should reflect that and T1 must get a smaller timestamp than T2. Spanner does this by using physical timestamps.

It is not possible to get a completely accurate real-world timestamp. Even if we had an accurate time source, there would be synchronization delays that would add a level of uncertainty. The problem is compounded by having the database span data centers around the globe. Spanner tries to minimize the uncertainty in getting a timestamp and make it explicit.

Each datacenter is equipped with one or more highly accurate time servers, called time masters (a combination of a GPS receivers and atomic clocks is used). The time master ensures that the uncertainty of a timestamp can be strongly bounded regardless of where a server resides. Each spanserver has access to a TrueTime API that provides a narrow time interval that contains the current time, ranging from TT.now().earliest up to TT.now().latest. The earliest time is guaranteed to be in the past and the latest is guaranteed to be a timestamp in the future when the function was called. These values would typically be only milliseconds apart. Spanner can now use these timestamps to ensure that transactions satisfy the demands of external consistency.

Implementing external consistency

The key to providing external consistency is to ensure that any data committed by the transaction will not be visible until after the transaction’s timestamp. This means that even other systems that have a different clock should still see a wall clock time that is later than the transaction’s timestamp. To do this, spanner waits out any uncertainty. Before a transaction commits, it acquires a commit timestamp:

t = TT.now().latest

This is the latest possible value of the true time across all servers in the system. It then makes sure that no locks will be released until that time is definitely in the past. This means waiting until the earliest possible current time on any system is greater than the transaction timestamp:

TT.now().earliest > t

This wait is called a commit wait and ensures that any newer transaction that grabs any of the same resources will definitely get a later timestamp.

Note that there is no issue if multiple transactions have the same timestamp. Timestamps are strictly used for versioning to ensure that we provide a consistent view of the database while supporting lock-free reads. Consider the case of a transaction that needs to read hundreds of millions of rows of data. Many transactions may modify the database since you start your work but the view of the database will be consistent since all data read will be no later than a specified timestamp.

Summary

By making timestamp uncertainty explicit, Spanner could implement a commit wait operation that can wait out the inaccuracy of a timestamp and provide external consistency along with full ACID semantics. Coupled with storing multiple versions of data, Spanner provides lock-free reads.

Spanner’s design was a conscious decision to not sacrifice the strong ACID semantics of a database. Programming without ACID requires a lot of extra thinking about the side effects of eventual consistency and careful programming if one wants to avoid it. Strict two-phase locking and a two-phase commit protocol are implemented within Spanner so programmers do not have to reinvent it.

Bulk Synchronous Parallel & Pregel

Go here for lecture notes

Goal: Create a software framework for fault tolerant, deadlock-free parallel processing. Then, adapt that to create an software framework makes it easy to operate on graphs on a massive scale.

Bulk Synchronous Parallel (BSP)

BSP
BSP

Bulk Synchronous Parallel (BSP) is a programming model and computation framework for parallel computing. Computation is divided into a sequence of supersteps. In each superstep, a collection of processes executes concurrently and creates messages that are sent to other processes. The superstep ends when all the computation in the superstep is complete and all messages are sent. A barrier synchronization at the end of the superstep ensures that all messages have been transmitted (but not yet delivered to the processes). The next superstep begins with the delivery of all those messages to the processes, who then execute their superstep and send messages that will be delivered at the start of the next superstep. This process continues until all processors vote to halt.

Note that no process is permitted to send and receive messages from with another process during a superstep. Any sent messages will be delivered only at the start of the next superstep. This restriction ensures deadlock-free execution.

A popular implementation of the BSP framework is Apache Hama.

Pregel

What’s a graph?

A graph is a set of vertices connected by edges. Edges may be directed from one vertex to another or bidirectional. In computing, a vertex is represented by an object and a directed edge is a link to another object.

Graphs are all around us

They represent computer networks, social groups, roads, disease outbreaks, phone call connections, Twitter followers, Facebook friends, web page links, etc. Some of these graphs have an enormous scale. The world wide web has billions of pages and Facebook has around a billion users.

What’s wrong with MapReduce?

Nothing is wrong with MapReduce. It’s great for many problems. However, many graph traversal problems, when written for MapReduce, end up having to take multiple iterations of MapReduce, with the output of one iteration feeding the input of another. This is not only slow, but it is also inefficient as data has to be written to files at the end of every map and reduce operation and moved between machines. The entire state of the graph needs to be transmitted in each stage, which requires a lot more communication overhead than Pregel, where the vertices and edges remain on the machine that performs the computation. Moreover, MapReduce is not the most direct way of thinking about and logically solving many problems.

Introducing Pregel

Pregel is a software framework created by Google to make it easy to work on huge graphs (e.g., ones with billions of vertices) that span many machines (e.g., tens of thousands). Like MapReduce, the framework relieves the programmer from having to deal with assigning tasks to processors, monitoring jobs, handling failures, and managing communications.

The Pregel framework allows you to write “vertex-centric” code. That is, the same user code, a compute() function, is run concurrently on each vertex of the graph. Each instance of this function keeps track of information, can iterate over outgoing edges (each of which has a value), and can send messages to the vertices connected to those edges or to any other vertices it may know about (e.g., having received a vertex ID via a message).

When a function does not have any more work to do, it votes to halt. This puts the corresponding vertex in an inactive state. When all vertices are in an inactive state, the framework terminates. However, if a vertex’s compute function sends a message to an inactive vertex, that vertex will be made active at the next superstep.

Pregel is an adaptation of the Bulk Synchronous Parallel (BSP) model that is specifically designed with graph processing in mind. Like BSP, Pregel executes in supersteps. Each superstep consists of computation followed by message sending. All messages are synchronized with a barrier, which marks the end of the superstep. At the next superstep, the messages from the previous superstep are delivered to, and available at, the compute function. The downside of using BSP directly for graph processing is that significant additional work would be needed to define, propagate, and maintain the topology (connections) of a graph and map vertices to compute nodes.

Advanced APIs

Since there is overhead in sending messages to vertices, particularly when they are on other machines, Pregel supports the optional use of combiners. A combiner is a user-defined function that is applied to a bunch of messages all targeted to the same vertex. The Combine method processes the values and creates a single input to that vertex. For example, if the goal is to take the sum of a set of values or to choose data from the highest-numbered vertex, the combiner can merge several messages into one before they are transmitted over the network. This does not alter the compute function since it still has to be prepared to receive multiple messages from multiple sources.

To manage global state, such as overall statistics, total edges of a graph, global flags, or minium or maximum values of a vertex, Pregel allows a user to define an aggregator. An aggregator comines received values into one value and makes that value available to all vertices at the next superstep.

Pregel design

Pregel uses a master-slave architecture. Many copies of the program are started on a cluster of machines. One copy becomes the master and is responsible for coordinating activity rather than processing the graph. Others are workers. The master registers itself with a name server (Chubby). Each worker process contacts the name server to find the master. The master assigns one or more sets of vertices to each worker. By default, the assignment is based on the hash of the vertex ID, so neighboring vertices will not necessarily be assigned to the same worker.

The master assigns chunks of input, which is usually resident in GFS or Bigtable, to each worker. Each input item is a set of vertices and its edges. Workers read this input and create either local messages for the vertices they manage or, if the input record is for a remote vertex, send the message to the worker that owns that vertex. All vertices are initially marked as active. The master then asks each worker to perform a superstep. The worker will run concurrent threads, each of which executes a compute function on its vertex. The compute function consumes input, runs its algorithm, and generates zero or more messages to other vertices. Workers send these messages asynchronously but they will not be delivered to their target functions until the next superstep starts. When a worker is done with its processing for one superstep, it informs the master. It also tells the master how many of the vertices it manages will be in the active state in the next superstep. The master waits for all workers to complete before starting the next superstep. This cycle continues until there are no more vertices in the active state.

Fault tolerance

Pregel uses checkpointing for fault tolerance. The master tells each worker to checkpoint itself every N supersteps. Checkpointing means that each vertex has to save its entire state in stable storage. This state will include vertex values, edge values, incoming messages, and possibly any other state that the algorithm needs to track. A master will periodically send ping messages to workers to see if they are alive. If a master does not hear from a worker within a certain window of time, it assumes that the worker has failed. In this case, the master reassigns the set of vertices to other workers and tells all workers to restart their computation from the superstep at the most recent checkpoint.

A popular implementation of Pregel is Apache Giraph, which has been used by Facebook to analyze its social graph.

Spark

Go here for lecture notes

Goal: Create a general-purpose high-performance framework for big-data processing.

MapReduce was a powerful framework for parallel computation but it forced a rigid model onto the programmer. Quite often, a computation had to be implemented as a sequence of MapReduce operations. Every map and every reduce operation has to run to completion and write its results to a file before the next sequence of operations can start. Spark creates a highly-flexible framework that lets programmers define their job as a sequence of tranformations and actions on data.

The main client application is called the driver program (or just driver) and is linked with a Spark library. When run, the library creates a SparkContext, which connects to a cluster manager that allocates available worker nodes to run the job. Each worker node runs an executor that contacts the driver for tasks to run. Each executor runs as a process inside a Java Virtual Machine (JVM).

The driver goes through the program, which consists of a sequence of transformations and actions as well as the source data. It creates a directed graph of tasks, identifying how data flows from one transformation to another and ultimately to an action. These tasks are then sent to the executors as jar files for execution. Tasks operate on data and each task is either a transformation or action.

RDDs

Data in Spark is a collection of Resilient Distributed Datasets (RDDs). RDDs can be created in three ways:

  1. They can be data that is a file or set of files in HDFS, key-value data from an Amazon S3 server (similar to Dynamo), HBase data (Hadoop’s version of Bigtable), or the result of a SQL or Cassandra database query.

  2. They can be streaming data obtained using the Spark Streaming connector. As an example, this could be a stream of data from a set of remote sensors.

  3. An RDD can be the output of a transformation function. This is the way one transformation creates data that another transformation will use. To optimize performance, the RDD created by a transfomation is cached in memory (overflowing to disk if necessary).

RDDs have several key properties:

  • They are immutable. A task can read an RDD and create a new one but cannot modify an existing RDD.

  • They are typed (structured). An RDD has a structure that a task can parse.

  • They are partitioned. Parts of an RDD may be sent to different servers. The default partitioning function is to send a row of data to the server corresponding to hash(key) mod servercount.

  • They are ordered. An RDD contains elements that can be sorted. For example, a list of key-value pairs can be sorted by key. This property is optional.

Transformations and actions

Spark allows two types of operations on RDDs: transformations and actions. Transformations read an RDD and create a new RDD. Example transformations are map, filter, groupByKey, and reduceByKey. Transformations are evaluated lazily, which means they are computed only when some other task needs the RDD that they generate. At that point, the driver schedules the task for execution.

Actions are operations that evaluate and return a value to the driver program. When an action is requested on an RDD object, the necessary transformations are computed and the result is returned to the driver. Example actions are reduce, grab samples, and write to file.

Fault tolerance

For each RDD, the driver tracks the sequence of transformations used to create it. That means an RDD knows which RDDs it is dependent on and which tasks needed to create each one. If any RDD is lost (e.g., a task that creates one died), the driver can ask the task that generated it to recreate it. The driver maintains the entire dependency graph, so this recreation may end up being a chain of transformation tasks going back to the original data.

Content delivery networks

Go here for lecture notes

Goal: Provide a highly-scalable infrastructure for caching and serving content from multiple sources.

A content delivery network (CDN) is a set of servers that are usually placed at various points at the edges of the Internet, at various ISPs, to cache content and distribute it to users.

There are several traditional approaches to making a site more scalable and available:

Proxy servers
Organizations can pass web requests through caching proxies. This can help a small set of users but you’re out of luck if you are not in that organization.
Clustering within a datacenter with a load balancer
Multiple machines within a datacenter can be load-balanced. However, they all fail if the datacenter loses power or internet connectivity.
Multihoming
Machines can be connected with links to multiple networks served by multiple ISPs to guard against ISP failure. However, protocols that drive dynamic routing of IP packets (BGP) are often not quick enough to find new routes, resulting in service disruption.
Mirroring at multiple sites
The data can be served at multiple sites, with each machine’s content synchronized with the others. However, synchronization can be difficult.

All these solutions require additional capital costs. You’re building the capability to handle excess capacity and improved availability even if the traffic never comes and the faults never happen.

By serving content that is replicated on a collection of servers, traffic from the main (master) server is reduced. Because some of the caching servers are likely to be closer to the requesting users, network latency is reduced. Because there are multiple servers, traffic is distributed among them. Because all of the servers are unlikely to be down at the same time, availability is increased. Hence, a CDN can provide highly increased performance, scalability, and availability for content.

Akamai

We will focus on one CDN: Akamai. The company evolved from MIT research that was focused on “inventing a better way to deliver Internet content.” A key issue was the flash crowd problem: what if your web site becomes really popular all of a sudden? Chances are, your servers and/or ISP will be saturated and a vast number of people will not be able to access your content. This became known as the slashdot effect.

In late 2016, Akamai ran on over 216,000 servers on over 1,500 networks in over 120 countries. It serves between 15 and 30 percent of all web traffic.

Akamai tries to serve clients from nearest, available servers that are likely to have requested content. According to the company’s statistics, 85% percent of the world’s Internet users are within a single network hop of an Akamai CDN server.

To access a web site, the user’s computer first looks up a domain name via DNS. A mapping system locates the caching server that can serve the content. Akamai deploys custom dynamic DNS servers and customers who use Akamai’s services register their domains to use those servers. They use the requestor’s address to find the nearest edge server that is likely to hold the cached content for the requested site.

When an Akamai DNS server gets a request to resolve a host name, it chooses the IP address to return based on:

  • domain name being requested

  • server health

  • server load

  • user location

  • network status

  • load balancing

Akamai can also perform load shedding on specific content servers; if servers get too loaded, the DNS server will not respond with those addresses.

The next step in accessing a site is to send the request to the edge server that was provided by the DNS lookup. That edge server may already have the content and be able to serve it directly. Otherwise, it will need to contact the origin server (the server at the company that hosts the content) via its transport system.

To do this efficiently, Akamai manages an overlay network: the collection of its thousands of servers and statistics about their availability and connectivity. Akamai generates its own map of overall IP network topology based on BGP (Border Gateway Protocol) and traceroute data from various points on the network.

Content servers report their load to a monitoring application. The monitoring application publishes load reports to a local Akamai DNS server, which then determines which IP addresses to return when resolving names.

A CDN, serving as a caching overlay, provides three distinct benefits:

  1. Caching: static content can be served from caches, thus reducing the load on origin servers.

  2. Routing: by measuring latency, packet loss, and bandwidth throughout its collection of servers, The CDN can find the best route to an origin server, even if that requires forwarding the request through several of its servers instead of relying on IP routers to make the decision.

  3. Security. Because all requests go to the CDN, it absorbs any Distributed Denial-of-Service attacks (DDoS) rather than overwhelming the origin server. Moreover, any penetration attacks target the machines in the CDN rather than the origin servers.

Clusters

Go here for lecture notes

Goal: Combine computers together to create high performing and/or highly reliable systems that provide users with a single system image.

Clustering is the aggregation of multiple independent computers to work together and provide a single system that offers increased reliability and/or performance. It is a realization of the single system image that we discussed at the start of the semester. Clusters are generally off-the-shelf computers that are connected to a local area network that allows them to communicate with other computers in the cluster. A cluster may be a collection of tens of thousands (or more) computers (e.g., google cluster) or just a backup computer to take over for a failed web server or database.

There are four classes of cluster architectures:

Supercomputing
Also known as high-performance computing, or HPC. The goal of an HPC cluster is to create a computing environment that resembles that of a supercomputer.
High Availability (HA)
The goal in this cluster it to ensure maximum availability by providing redundant systems for failover.
Load Balancing
A load balancing cluster distributes requests among a collection of computers. In doing so, it addresses both scalability and high availability.
Storage
Storage clustering is a way to ensure that systems can all access the same storage. It is also a way to make vast amounts of storage available to a computer without having to put it inside the computer (where it will not be available if the computer fails).

The boundaries between these cluster types are often fuzzy and many clusters will use elements of several of these cluster types. For example, any cluster may employ a storage cluster. High availability is important in supercomputing clusters since the likelihood of any one computer failing increases with an increasing number of computers. There is no such thing as a “standard” cluster.

Cluster components

A cluster needs to keep track of its cluster membership: which machines are members of the cluster. Related to this are the configuration and service management components. The configuration system runs on each node (computer) in the cluster and manages the setup of each machine while the service management component identifies which nodes in the cluster perform which roles (e.g., standby, active, running specific applications).

A quorum is the number of nodes in a cluster that have to be alive for the cluster to function. Typically, a majority is required. This provides a simple way of avoiding split-brain due to network partitioning where one group of computers cannot talk to another and two instances of the cluster may be created. With a majority quorum, a minority of systems will never create their own cluster.

A cluster interconnect is the network that allows computers in a cluster to communicate with each other. In most cases, this is just an Ethernet local area network. For performance, bandwidth and latency are considerations. Communicating outside of a rack incurs longer cable runs and the overhead of an extra switching stage. Communicating outside of a data center incurs even longer latency. For maximum performance, we would like computers that communicate frequently to be close together physically. However, for maximum availability, we would like them to be distant. If a rack switch or an entire data center loses power, it would be good to have a working replica elsewhere. For high performance applications within a local area, a dedicated network is often used as a cluster interconnect. This is known as a System Area Network (SAN). A high-performance SAN will provide low latency, highly reliable, switched communication between computers. By using a SAN, the software overhead of having to run the TCP/IP stack, with its requisite fragmentation, buffer management, timers, acknowledgements, and retransmissions, is largely eliminated. Remote DMA (RDMA) allows data to be copied directly to the memory of another processor. SANs are often used for HPC clusters, with SAN/RDMA communication incorporated into the Message Passing Interface (MPI) library, which is commonly used in high performance computing applications. Examples of SAN interconnects are Infiniband, Myrinet, and 10 Gbps ethernet with Data Center Bridging. They are generally used to connect a relatively small number of computers together.

A heartbeat network is the mechanism that is used to determine whether computers in the cluster are alive or dead. A simple heartbeat network exchanges messages between computers to ensure that they are alive and capable of responding. Since a local area network may go down, one or more secondary networks are often used as dedicated heartbeat networks in order to distinguish failed computers from failed networks. Asynchronous networks, such as IP, make the detection of a failed computer problematic: one is never certain whether a computer failed to send a message or whether the message is delayed beyond a timeout value.

Storage clusters

Storage in a clustered computer system can be provided in a variety of ways. Distributed file systems, such as NFS, SMB, or AFS can be used. These provide file-level remote access operations over a network.

A Storage Area Network (SAN, not to be confused with a System Area Network) is a dedicated network for connecting computers to dedicated disk systems (storage arrays). Common SAN interconnect technologies include iSCSI, which uses the SCSI protocol over the Ethernet, and Fibre Channel. Computers access this remote storage at the block level (read a specific block, write a specific block), just like they would access local storage. With a SAN, however, access to the same storage can be shared among multiple computers. This environment is called shared disk. A distributed lock manager, or DLM, manages mutual exclusion by controlling access to key resources on the shared disk so that, for example, two computers will not try to write to the same disk block at the same time. A clustered file system is a file system that is built on top of a shared disk. Unlike a distributed file system (NFS, SMB, et al.), which uses remote access at a file level, each computer’s operating system implements a full file system and makes requests at the block level. Examples of such file systems include the Oracle Cluster File System for Linux (OCFS2), Red Hat’s Global File System (GFS2), and Microsoft’s Cluster Shared Volumes (CSV). The DLM is used to ensure that critical shared file system data structures, such as bitmaps of free blocks, inode structures, and file lock tables, are accessed exclusively and caching is coherent. It operates at the level of the implementation of a file system rather than high-level file systems services as in distributed file systems. As such, it differs from something like the NFS lock daemon, which kept track of file locks requested by applications rather than block-level locks needed to keep a file system coherent.

A shared nothing cluster architecture is one where each system is independent and there is no single point of contention in the system, such as competing for access to a shared disk. Because there is no contention, there is no need for a DLM. In this environment, any data that is resident on a system’s disk can only be obtained by sending a request to the computer that owns the disk. If the computer dies, the data is generally unavailable but may be replicated on other nodes. An alternative design that uses a SAN can allow disk access to be switched to another computer but ensure that only one computer accesses the file system at any time.

To make disks themselves highly available, RAID (redundant array of independent disks) is often employed. RAID 1 is disk mirroring. Anything that is written to one disk gets written to a secondary disk. If one fails then you still have the other. RAID 5 and RAID 6 stripes the data across several disks and also adds in error correcting codes so that it data could be reconstructed from the available segments if one would die (e.g., parity to allow recovering data lost if one disk fails).

High-Performance Computing (HPC)

High-performance clusters (HPC) are generally custom efforts but there are a number of components that are common across many implementations. HPCs are designed for traditional supercomputing applications that focus on a large amount of computation on large data sets. These applications are designed to be partitioned into multiple communicating processes. The Message Passing Interface (MPI) is a popular programming interface for sending and receiving messages that handles point-to-point and group communication and provides support for barrier-based synchronization. It is sometimes used together with the Parallel Virtual Machine (PVM), a layer of software that provides an interface for creating tasks, managing global task IDs, and managing groups of tasks on arbitrary collections of processors. PVM is in many ways similar to MPI but designed to be more dynamic and support heterogenous environments. However, its performance was not up to the levels of MPI and its popularity is waning. Beowulf and Rocks Cluster are examples of HPC clusters based on Linux. Microsoft offers high performance clustering via the Microsoft HPC Pack. There are many other HPC systems as well. The common thread among them all is that they provide a front-end server for scheduling jobs and monitoring processes and offer an MPI library for programming.

Batch Processing: Single-Queue Work Distribution

Single queue work distribution is a form of high performance computing that does not rely on communication between computing nodes. Where traditional HPC applications usually involve large-scale array processing and a high level of cooperation among processing elements, the work distribution approach is used for applications such as render farms for computer animation, where a central coordinator (dispatcher) sends job requests to a collection of computers. When a system completes a job (e.g., “render frame #4,178”), the dispatcher will send it the next job (e.g., “now render frame #12,724”). The dispatcher will have the ability to list jobs, delete jobs, dispatch jobs, and get notified when a job is complete. The worker nodes have no need to communicate with each other.

Load Balancing

Web-services load balancing is a somewhat trivial but very highly used technique for distributing the load of many network requests among a collection of computers, each of which is capable of processing the request. Load balancing serves three important functions:

  1. Load balancing. It enables scalability by distributing requests among multiple computers.

  2. High availability (failover). If a computer is dead, the requests will be distributed among the remaining live computers.

  3. Planned outage management. If a computer needs to be taken out of service temporarily (for example, to upgrade software or replace hardware), requests will be distributed among the remaining live computers.

The simplest form of load balancing is to have all requests go to a single computer that then returns an HTTP REDIRECT error. This is part of the HTTP protocol and will lead the client to re-issue the request to the computer specified by the REDIRECT error.

Another, and the most popular approach, is to use a load-balancing router to map incoming requests to one of several multiple back-end computers.

For load balancing across data centers, DNS-based load balancing may be used where a DNS query returns IP addresses of machines at different data centers for domain name queries.

High Availability

High-availability clusters strive to provide a high level of system uptime by taking into account the fact that computers may fail. When this happens, applications running on those computers will resume on other computers that are still running. This is called failover.

Low-level software to support high-availability clustering includes facilities to access shared disks and support for IP address takeover, which enables a computer to listen on multiple IP addresses so that IP packets that were sent to a failed machine can reach the backup system instead.

Mid-layer software includes distributed elections to pick a coordinator, propagation of status information, and figuring out which systems and applications are alive. Higher-layer software includes the ability to restart applications, let a user assign applications to computer, and let a user see what’s going on in the system as a whole.

An active/passive configuration is one where one or more backup (passive) systems are waiting to step in for a system that died. An active/active configuration allows multiple systems to handle requests. Requests may be load balanced across all active systems and no failover is needed; the dead system is simply not sent any requests.

Failover can be implemented in several ways:

Cold failover
This is an application restart — the application is started afresh from the beginning. An example is starting up a web server on a backup computer because the primary web server died. There is no state transfer.
Warm failover
Here, the application is checkpointed periodically. It can then be restarted from from the last checkpoint. Many cluster libraries provide the ability for a process to checkpoint itself (save its memory image). Pregel is an example of a software framework that relies on periodic checkpointing so that a graph computation does not have to restart from the beginning.
Hot failover
Here, a replica application is always kept synchronized with the active application on another computer. An example of this is a replicated state machine. Chubby servers, for example, implement hot failover: if the Chubby master fails, any other machine in the Chubby cluster can step in.

Cascading failover refers to the ability of an application to fail over even after it already has failed over in the past. Multi-directional failover refers to the ability to restart applications from a failed system on multiple available systems instead of a specific computer that is designated for use as a standby system.

An annoying malfunction is a byzantine failure. In this case, the failed process or computer continues to communicate but communicates with faulty data. Related to this is the problem of fail-restart behavior, where a process may restart but not realize that the data it is working with is obsolete (e.g., a transaction coordinator might restart and not realize that the transaction has already been aborted). Fencing is the use of various techniques to isolate a node from the rest of the cluster. Power fencing shuts off power to the node to ensure that the node does not restart. SAN fencing disables the node from accessing shared storage, avoiding possible file system corruption. Other fencing techniques may block network messages from the node or remove processes from a replication group (as done in virtual synchrony).

Security: Cryptographic Communication

Go here for lecture notes

Goal: Use symmetric cryptography, public key cryptography, random numbers, and hash functions to enable exchange encryption keys, provide secure communication, and ensure message.

Security encompasses many things, including authenticating users, hiding data contents while they are at rest (e.g., stored in files) or in motion (moving over a network), destruction of data, masquerading as another server or user, providing false data (e.g., the wrong IP address for a DNS query), and physical premises security. We will focus only on a few topics here.

Security in distributed systems introduces two specific concerns that centralized systems do not have. The first is the use of a network where contents may be seen by other, possibly malicious, parties. The second is the use of servers. Because clients interact with services (applications) running on a server, the application rather than the operating system is responsible for authenticating the client and controlling access to services. Moreover, physical access to the system and the security controls configured for the operating system may be unknown to the client.

Cryptography on its own is not a panacea. Its use will not ensure that a system is secure. However, it is an important component in building secure distributed systems. It is used to implement mechanisms for:

  • Confidentiality: ensuring that others cannot read the contents of a message (or file).
  • Authentication: proving the origin of a message (or the entity sending that message).
  • Integrity: validating that the message has not been modified, either maliciously or accidentally.
  • Nonrepudiation: ensuring that senders cannot falsely deny that they authored a message.

Confidentiality

Confidentiality is the classic application of cryptography: obscuring the contents of a message so that it looks like a random bunch of bits that can only be decoded by someone with the proper knowledge. To do this, we start with our plaintext message, P (the term plaintext refers to unencrypted content; it does not need to be text). By applying an encryption algorithm, or cipher, to it, we convert the plaintext to ciphertext, C=E(P). We can decrypt this message to recover the plaintext: P=D(C).

There are two widely-used classes of ciphers: symmetric and public key. A symmetric cipher uses the same to decrypt a message as was used to encrypt it: C=EK(P) and P=DK(C). A public key cipher uses two related keys, known as a key pair. Knowing one key does not enable you to derive the other key of the pair. One of these keys is called the private key because it is never shared. The other key is called the public key because it is freely shared. Examples of popular symmetric encryption algorithms include AES (Advanced Encryption Standard), DES (Data Encryption Standard), and Blowfish.

A message encrypted with one of the keys can only be decrypted with the other key of the pair. For a pair of keys (a, A), where a is the private key and A is the public key, if C=Ea(P) then P=DA(C). A message encrypted with your private key can be decrypted only with your public key. Anyone can do the decryption but you are the only one who could perform the encryption. This becomes the basis for authentication. Conversely, if C=EA(P) then P=Da(C). A message encrypted with your public key can be decrypted only with your private key. Anyone can now perform the encryption but you are the only one can decrypt the message. This becomes the basis for confidentiality and secure communication. Examples of popular public key encryption algorithms include AES (Advanced Encryption Standard), DES (Data Encryption Standard), and Blowfish.

Key distribution

Symmetric encryption algorithms are the dominant means of encrypting files and large streams of data. They are much faster than public key algorithms and key generation is far easier: a key is just a random set of bits rather than two huge numbers with specific properties. The biggest problem with symmetric cryptography is key distribution: ensuring that both sides get the key in a way that no eavesdropper can observe it. For instance, you cannot just send it as a network message containing the key. There are several techniques to handle the problem of key distribution.

  1. Manually, by using pre-shared keys (PSK). This requires setting up the keys ahead of time, such as registering a password or PIN among all the parties that will communicate.

  2. Using a trusted third party. A trusted third party is a trusted server that knows everyone’s keys. If two systems, let’s call them Alice and Bob, want to communicate, the third party will help both of them get the key they need. To communicate, they will use a session key. This is the name for a throw-away key that will be used for one communication session. One way of sharing it is for Alice to create it (it’s just a random number), encrypt it with her secret key, and send it to the trusted third party, which we will name Trent. Since Trent has all the keys, he can decrypt the message with Alice’s secret key to extract the session key. He can then encrypt it for Bob using Bob’s secret key and send it to Bob.

    Alternatively, Alice can ask Trent to create a session key that both Alice and Bob will share. Trent will generate the key, encrypt it for Alice, and send it to Alice. He will take the same key, encrypt it for Bob, and send it to Bob. Now they both share a key.

  3. Using the Diffie-Hellman key exchange algorithm. Diiffie-Hellman is a key distribution algorithm, not en encryption algorithm. It uses a one-way function, which is a mathematical function where there is no known way of computing the inverse function to do this. Unfortunately, the algorithm uses the terms public key and private key even though it is not an encryption algorithm; the keys are just numbers, one of which is kept private. Two parties can apply the one-way function to generate a common key that is derived from one public key and one private key. Alice generates a common key that is f(Alice_private_key, Bob_public_key) and Bob generates a common key that is f(Bob_private_key, Alice_public_key). The magic of the mathematics in Diffie-Hellman is that both common keys will be identical and can be used as a symmetric encryption key.

  4. Avoid symmetric cryptography and just use public key cryptography. Life becomes easy. If Alice wants to send a message to Bob, she simply encrypts it with Bob’s public key. Only Bob will be able to decrypt it using his private key. If Bob wants to send a message to Alice, he will encrypt it with her public key. Only she will be able to decrypt it using her private key. We have two concerns with this. First, we need to ensure that everyone has trusted copies of public keys (Alice needs to be sure she has Bob’s public key rather than that of an imposter saying he’s Bob). Second, we have to be mindful of the fact that public key cryptography and key generation are both far slower than symmetric cryptography.

  5. To get the best of both worlds, we often rely on hybrid cryptography: we use public key cryptography only to encrypt a symmetric session key (a small amount of data). From then on, we use symmetric cryptography and encrypt data with that session key.

Message Integrity & Non-repudiation

Without cryptography, we often resort to various error detecting functions to detect whether there is any data corruption. For example, ethernet uses CRC (cyclic redundancy checksums) and IP headers use 16-bit checksums. A cryptographic hash function is similar in purpose: a function that takes an arbitrary amount of data and generates a fixed set of bits. The hash is far longer than typical checksums – typically 256 or 512 bits – and hence is highly unlikely to result in collisions, where multiple messages result in the same hash value. A cryptographic hash of a message, H=hash(M) should have the following properties:

  • It is a one-way function. Given a hash value H, it should be difficult to figure out what the input is. (N.B.: in cryptography, the term “difficult” means that there are no known shortcuts other that trying every possible input).
  • It is collision resistant. Given a hash value H, it should be difficult to find another message M’, such that H=hash(M’)
  • It is efficient. We want to be able to generate these easily to validate file contents and messages.

Just augmenting a message with its hash is not sufficient to ensure message integrity. It will allow us to detect that a message was accidentally modified but an attacker who modifies a message can easily recompute a hash function. To prevent the hash from being modified, we can encrypt it.

A message authentication code (MAC) is a hash that is encrypted with a symmetric key. If Alice and Bob share a key, K, Alice can send Bob a message along with a hash encrypted with K. Bob can hash the message and compare it with the decrypted hash. If an intruder modifies the message, she will not be able to create a new encrypted hash.

A digital signature is similar to a MAC but the hash is encrypted with the sender’s private key. Here, Alice will send Bob a message along with a hash that is encrypted with her private key. Bob can validate the message and signature by decrypting the signature using Alice’s public key and comparing the result with a hash of the message. A digital signature provides non-repudiation: Alice cannot claim that she did not create the message since only Alice knows her private key. Nobody but Alice could have created the signature.

We frequently combine covert messaging together with message integrity to ensure that a recipient can detect if a message has been modified. One way of doing this is:

  1. Alice creates a session key, S (a random number).
  2. She encrypts the session key with Bob’s public key, B: EB(S) and sends it to Bob.
  3. Bob decrypts the session key using his private key, b: S = Db(EB(S)). He now knows the session key.
  4. Alice encrypts the message with the session key: ES(M).
  5. Alice creates a signature by encrypting the hash of the message with her private key, a: Ea(H(M)).
  6. She sends both the encrypted message and signature to Bob.
  7. Bob decrypts the message using the session key: M = DS(ES(M)).
  8. Bob decrypts the signature using Alice’s public key A: DA(Ea(H(M)))_
  9. If the decrypted hash matches the hash of the message, H(M), Bob is convinced that the message has not been modified since Alice sent it.

Authentication

Go here for lecture notes

Goal: Create protocols for authenticating users, establishing secure communication sessions, authorizing services, and passing identities.

Authentication exists to establish and verify the identity of a user (or a service, process, or server). Once authentication is complete, a process can decide whether to allow access to the service or its resources and what type of access is permitted. This is called authorization.

The three factors of authentication are: something you have (such as a key or a card), something you know (such as a password or PIN), and something you are (biometrics).

Each of these factors has pitfalls and can be stolen. Someone can take your access card or see your password. Biometrics are more insidious: they are not precise and if someone can recreate your stolen biometric then you can never use it again.

Combining these factors into a multi-factor authentication scheme can increase security against the chance that any one of the factors is compromised. Multi-factor authentication must use two or more of these factors. Using two passwords, for example, is not sufficient.

Password Authentication Protocol (PAP)

The best-known authentication method is the use of reusable passwords. This is known as the password authentication protocol, or PAP. The system asks you to identify yourself (your login name) and then enter a password. If the password matches that which is associated with the login name on the system then you’re authenticated.

One problem with the protocol is that if someone gets hold of the password file on the system, then they have all the passwords. The common way to thwart this is to store hashes of passwords instead of the passwords themselves. This takes advantage of the one-way property of the hash. To authenticate a user, check if hash(password) = stored_hashed_password. If an intruder gets hold of the password file, they’re still stuck since they won’t be able to reconstruct the original password from the hash. They’ll have to resort to an exhaustive search or a dictionary attack to search for a password that hashes to the value in the file. An exhaustive search may take a prohibitively long time.

A dictionary attack is an optimization of the search that tests common passwords, including dictionary words and common letter-number substitutions. An intruder does not need to perform a search for each password to find a matching hash. Instead, the results of an exhaustive or dictionary search can be stored and searched quickly to find a corresponding hash in a password file. These are called precomputed hashes. To guard against this, a password is concatenated with a bunch of extra random characters, called salt. These characters make the password substantially longer and a table of precomputed hashes insanely huge and hence not practical to use. The salt is not a secret – it is stored in plaintext in the password file in order to validate a user’s password. Its only function is to make using precomputed hashes impractical and ensure that even identical passwords do generate the same hashed results.

The other problem with reusable passwords is that if a network is insecure, an eavesdropper may sniff the password from the network. A potential intruder may also simply observe the user typing a password. To thwart this, we can turn to one-time passwords. If someone sees you type your credentials or read them from the network stream, it won’t matter because that information will be useless for future logins.

CHAP Authentication

The Challenge-Handshake Authentication Protocol (CHAP) is an authentication protocol that allows a server to authenticate a user without sending a password over the network.

Both the client and server share a secret (such as a password). A server creates a random bunch of bits (called a nonce) and sends it to the client (user) that wants to authenticate. This is the challenge.

The client identifies itself and sends a response that is the hash of the shared secret combined with the challenge. The server has the same data and can generate its own hash of the same challenge and secret. If the hash matches the one received from the client, the server is convinced that the client knows the shared secret and is therefore legitimate.

An intruder who sees this hash cannot extract the original data. An intruder who sees the challenge cannot create a suitable hashed response without knowing the secret. Note that this technique requires passwords to be accessible at the

server and the security rests on the password file remaining secure.

Time-based: TOTP

With the Time-based One Time Password (TOTP) protocol, both sides share a secret key. To authenticate, a user runs the TOTP function to create a one-time password. The TOTP function is a password created as a hash of a shared secret key and the time of day. The service, who also knows the secret key and time, can generate the same hash and thus validate the value presented by the user.

TOTP is often used as a second factor (proof that you have some device with the secret configured in it) in addition to a password. The protocol is widely supported by companies such as Amazon, Dropbox, Wordpress, Microsoft, and Google.

Public key authentication

Public key authentication relies on the use of nonces. A nonce is simply a randomly-generated bunch of bits and is sent to the other party as a challenge for them to prove that they are capable of encrypting something with a specific key that they possess. The use of a nonce is central to public key authentication.

If Alice wants to authenticate Bob, she needs to have Bob prove that he possesses his private key (private keys are never shared). To do this, Alice generates a nonce (a random bunch of bits) and sends it to Bob, asking him to encrypt it with his private key. If she can decrypt Bob’s response using Bob’s public key and sees the same nonce, she will be convinced that she is talking to Bob because nobody else will have Bob’s private key. Mutual authentication requires that each party authenticate itself to the other: Bob will also have to generate a nonce and ask Alice to encrypt it with her private key.

Identity binding: digital certificates

While public keys provide a mechanism for asserting integrity via digital signatures, they are themselves anonymous. We’ve discussed a scenario where Alice uses Bob’s public key but never explained how she can be confident that the key really belongs to Bob and was not planted by an adversary. Some form of identity binding of the public key must be implemented for you to know that you really have my public key instead of someone else’s.

Digital certificates provide a way to do this. A certificate is a data structure that contains user information (called a distinguished name) and the user’s public key. To ensure that nobody changes any of this data, this data structure also contains a signature of the certification authority. The signature is created by taking a hash of the rest of the data in the structure and encrypting it with the private key of the certification authority. The certification authority (CA) is an organization that is responsible for setting policies of how they validate the identity of the person who presents the public key for encapsulation in a certificate.

To validate a certificate, you hash all the certificate data except for the signature. Then you would decrypt the signature using the public key of the issuer. If the two values match, then you know that the certificate data has not been modified since it has been signed. The challenge now is how to get the public key of the issuer. Public keys are stored in certificates, so the issuer would also have a certificate containing its public key. The certificates for many of the CAs are preloaded into operating systems or, in some cases, browsers.

Transport Layer Security (Secure Sockets Layer)

Secure Sockets Layer (SSL, also known as TLS — Transport Layer Security, which is a more accurate term since it is the newer protocol) is a layer of software designed to provide authentication and secure communication over the abstraction of a sockets interface. It was designed to make it easy to add a secure transport onto insecure TCP socket based protocols (e.g., HTTP and FTP). SSL uses a hybrid cryptosystem and relies on public keys for authentication. If both the sender and receiver have digital certificates, SSL can have each party validate them and use nonce-based public key authentication to validate that each party has the corresponding private key. In most cases, only the service will provide a certificate. If the server does not have a certificate, SSL will then use a public key simply to allow a symmetric session key to be passed securely from client to server. In this case, no identity binding was performed but we still have an encrypted connection.

While SSL supports multiple algorithms for authentication and message encryption, the common form is one where the client generates a session key and encrypts it with the server’s public key. This ensures that only the server will be able to decode the message and get the session key. After that, communication takes place using a symmetric algorithm and the client-generated session key.

Service Authorization via OAuth

Go here for lecture notes

Goal: Enable users to provide limited permissions for one service to access another service.

Suppose that you want an app (or some network service) to access your Google calendar. One way to do this is to provide the app with the ID and password of your Google account. Unfortunately, this gives the app full control of the entire account and is something you may not feel comfortable granting.

OAuth is designed to allow users to control what data or services one service can access from another service.

For example, if you want a photo printing service to access photos from your flickr account, you don’t want to provide it with your flickr username and password for unrestricted access. Instead, OAuth allows you to specify what access you allow the photo printing service to have to flicker and for how long. Token credentials (a bunch of bits as far as your app or website is concerned) are used in place of the resource owner’s username and password for gaining access to the service. These token credentials include a token identifier and a secret.

There are three entities involved in OAuth:

  1. User: the user and the user’s browser.

  2. Client (application): this is the service that the user is accessing. For example, the moo.com photo printing service

  3. Authorization Server, acting as the Service Provider (server): this is the service that the consumer needs to access. For example, flicker.com to get the photos

We’ll use the moo.com photo printing and the flickr.com photo serving service as an example.

  1. Alice wants to order some prints and logs into moo.com. Moo knows about flickr and allows her to select select flickr as the source of her photos. When Moo built its service, its developers registered the service with flickr and obtained OAuth client credentials (client ID and secret).

  2. When Alice selects flickr, moo.com needs to contact flickr.com for an authorization code, which is a temporary credential. The application, moo.com, creates a request that contains a scope: a list of requested services that it needs from flickr (for example, get photos from your account). The application then redirects Alice to to the flicker OAuth page (that could be a separate server at flickr) with a directive to redirect her back to Moo when she’s done. The request contains the app’s ID, the app’s secret, and the scope.

  3. At the flicker OAuth page, she authenticates (using login/password or perhaps via OpenID, which may cause another level of redirection) and is presented with a description of what moo.com is requesting to do (e.g., access to download photos for the next 10 minutes). She can approve or reject the request.

  4. When Alice approves the request, she is redirected back to moo.com. The redirect contains the authorization code.

  5. Moo now contacts flickr.com directly and exchanges the authorization code for an access token. Authorization codes are used to obtain a user’s approval. Access tokens are used to access resources on the provider (server); that is, to call APIs. Moo now sends API requests to flicker containing the access token to flickr to download the requested photos.

Distributed Authentication via OpenID Connect

Go here for lecture notes

Goal: Allow services to use a third-party user authentication service to authenticate a user.

OpenID Connect was created to solve the problem of alleviating a user from managing multiple identities (logins and passwords) at many different services (web sites).

It is not an authentication protocol. Rather, it’s a technique for a service to redirect authentication to a specific OpenID Connect authentication server and have that server be responsible for authentication.

OpenID Connect is an identity layer on top of the OAuth 2.0 protocol. It uses the basic protocol of OAuth and the same communication mechanisms: HTTPS with JSON messages.

There are three entities involved:

  1. Client: this is the application that the user is accessing. It is often a web site but may be a dedicated app. (In OAuth, we think of this as the consumer).

  2. User: this is the user’s browser. In the case of a dedicated app, the app will take this role.

  3. Authorization Server, acting as the Identity Provider. This is the server that is responsible for managing your ID and authenticating you. It is called the Authorization Server because OpenID Connect is built on top of OAuth 2.0, a service authorization framework. For OAuth, we this of this as a Service Provider instead of Identity Provider.

OpenID Connect allows the client to answer the question, what is the identity of the person using this browser or app?

The protocol contacts an authorization server to get an access token and then uses that access token at the client.

Here’s the basic flow:

  1. Before OpenID Connect is used, a client may pre-register its service with an Authorization Server. This can allow a server administrator to control restrictions on policies for accessing user information. It also allows a client and server to agree upon a shared secret that will be used to sign future messages. If this is not done, public key cryptography can be used.

  2. The user is presented with a request to log in. The client may choose to support the use of multiple identity providers. In this case, the user identifies the identity provider in the user name. For example, joe@example.com. The protocol then knows that the authorization server is at example.com. Alternatively, the provider may force the use of a specific provider (e.g., Google). The client redirects the user to the authorization server. This is done via an HTTP REDIRECT message.

  3. The redirect results in the client sending an Authentication Request to the authorization server. This is an OAuth message with a scope (access request) that requests "openid’. The scope can also include other identifying data, such as user profile, email, postal address, and phone number.

  4. The authorization server authenticates the user using whatever protocol it wants to use.

  5. After authenticating the user, the authorization server requests the users permission for all requests listed in the scope. For example, SomeApp requests permission for: signing you in, your profile, your email address, your address, your phone number. This is the same as any OAuth request for services.

  6. If the user approves, the authorization server sends a redirect to switch the user back to the client. This redirect message contains an authorization code created by the authorization server that is now given to the client. The authorization code looks like a large bunch of random text and is of no use to the user.

  7. The client now sends a token request directly to the authorization server. The request includes the authorization code that it received. All traffic is encrypted via HTTPS to guard agains eavesdroppers. If the client pre-registered, the request will contain a pre-established secret so that the authorization server can validate the client.

  8. The server returns an access token and an ID token to the client. Note that the user is not involved in this flow.

The ID token asserts the user’s identity. It is a JSON object that:

  • identifies the Identity Provider (authorization server)
  • has an issue and expiration date
  • may contain additional details about the user or service
  • is digitally signed by the identity provider using either the provider’s private key or a shared secred that was set up during registration.

By getting this ID token, the client knows that, as far as the authorization server is concerned, the user has successfully authenticated. At this point, the client does not need to contact the authorization server anymore.

The access token is the same access token that OAuth provides clients so that they can request services. It has an expiration date associated with it and may be passed to the authorization server to request access to detailed information about the user or get other protected resources. What the client can request is limited by the scope of the Authentication Request in step 2.

OpenID and OAuth were separate protocols until 2014. They were merged such that OpenID Connect is a special mode of OAuth that requests authentication and provides the client with an ID token. However, the purpose of OAuth and OpennID are fundamentally different. OpenID is designed to allow a third party to manage user authentication. That is, it provides single sign-on: allowing a user to use the same login and a third party (OpenID) authentication service for many web sites. OAuth, on the other hand, is designed for service authorization. It is up to the remote service that you are accessing to decide if and how to authenticate you before allowing you to authorize access to that service.