Distributed Lookup

Domain Name System (DNS)

Paul Krzyzanowski

October 24, 2023

Goal: Design a system to look up domain names that can scale to the planet-wide internet and handle queries on billions of objects.

The Internet Domain Name System (DNS) is the naming system for nodes on the Internet. It associates human-friendly names with numeric IP addresses and other information about that node.

Introduction

The Internet Domain System, DNS, is the distributed system that enables the lookup of hundreds of millions of domain names. It’s an application-specific implementation, not a generic object store, but it is a collection of software that is used every time we access a web page, send email, or send a packet to any system on the Internet.

How are IP addresses assigned?

Before we get to Internet domain names, let’s touch on IP addresses. The Internet employs a hierarchical system for assigning IP addresses.

A global non-profit organization called ICANN, or the Internet Corporation for Assigned Names and Numbers, is responsible for managing IP addresses, autonomous system numbers that are used for routing, and the domain name system.

The Internet Assigned Numbers Authority, the IANA, is a department within ICANN that is responsible for assigning IP addresses and managing top-level domains. The IANA allocates chunks of the IP address space to five organizations called Regional Internet Registries (RIR). These cover large geographic areas.

For instance, ARIN is the American Regional Internet Registry and covers the U.S. and Canada. The full list of RIRs is can be found at nro.net and comprises:

  1. AFRINIC: African Network Coordination Centre
  2. APNIC: Asia-Pacific Network Coordination Centre
  3. ARIN: American Registry for Internet Numbers (U.S., Canada, Caribbean and North Atlantic islands)
  4. LACNIC: Latin American and Caribbean Internet Addresses Registry
  5. RIPE NCC: Réseaux IP Européens Network Coordination Centre (Europe, the Middle East, and parts of Central Asia)

These Regional Internet Registries, in turn, assign ranges of IP addresses to ISPs and other autonomous systems. An autonomous system (AS) is the term for a collection of IP networks and routers that are under the control of a single organization that presents a common routing policy to the Internet. Each AS is identified by a unique number. Like IP addresses, the top-level range is controlled by the IANA, and then the individual RIRs assign them to the network operators within their jurisdictions. These network operators can then assign smaller ranges or individual addresses to smaller ISPs or to their customers. For example, Rutgers is an Autonomous System (AS46) and owns the range of IP addresses 128.6.0.0 – 128.6.255.255 (128.6.0.0/16) and 165.230.0.0 – 165.230.255.255 (165.2300.0/16) as well as a few other smaller ranges

Organizations may get permanent or temporary addresses assigned to them. With a permanent assignment, you essentially get the IP address forever. With a temporary one, you need to request an available address and get one that you must renew periodically.

How are machine names assigned?

When the Internet was young, in the days when it was the ARPANET, all computer names and their corresponding addresses were managed by one person – Jon Postel at the Stanford Research Institute’s Network Information Center (SRI-NIC).

Computer names formed a flat namespace: each name had to be unique and there was no concept of domains or any form of hierarchy. Machines had names such as UCBVAX for a certain Vax computer at UC Berkeley or DECWRL for a computer at Digital Equipment Corporation’s Western Research Lab.

If you had a system on the internet, you would periodically download the latest copy of the hosts.txt file from SRI-NIC via FTP. It was a text file that contained the names of all the computers on the Arapanet and their corresponding IP address. By searching this file, programs could look up the address corresponding to a specific machine name.

This worked well when there weren’t a lot of hosts on the Internet. Until around 1990, the Internet was accessible only to companies and universities working on Department of Defense projects. As the number of hosts on the Internet grew, the system didn’t scale: asking people to download a file containing all the hosts on the Internet didn’t work anymore: the file would get huge and the information within it would change too frequently.

Domain name hierarchy

Coming up with names for computers also became an issue. It is challenging to create and manage meaningful unique names on a large scale (e.g., try picking an unused but meaningful handle for any popular social networking service). Hierarchical naming systems are commonly used to create names that provide uniqueness and facilitate management. A name that is made up of a list of components is called a compound name. We see this in names such as pathnames (/home/paul/src/qsync/main.c) and in Internet domain names (www.cs.rutgers.edu).

The growth of hosts on the Internet led to the creation of a hierarchical namespace of domain names. A domain is just an administrative grouping to manage names. A domain name is a set of textual names separated by dots and organized right to left, with the top of the hierarchy being the rightmost name. In the domain name www.cs.rutgers.edu, www is a machine under cs, which is under rutgers, which is under the edu domain.

Internet domain names form an arbitrarily deep tree-structured hierarchy that allows us to partition the management of computer names. For instance, rutgers is assigned a name under edu, which is a top-level domain reserved for education institutions. This doesn’t conflict with other places where rutgers might be used, such as rutgers.com, rutgers.net, or rutgers.party, each of which can belong to completely different organizations.

Rutgers can then create sub-domains within its rutgers.edu namespace to allow different groups to choose whatever they want under that part of the name.

Top-level domains

At the top of the hierarchy, under the root, we have top-level domains. There are three categories of top-level domains:

  1. Country code domains contain two-letter country code names, such as de for Germany, es for Spain, or uk for the United Kingdom.

  2. Internationalized domain names (IDN) top-level country code domains are top-level domains that are displayed in their native language. For example, .中国 for China, .ευ for Greece, and پاکستان. for Pakistan.

  3. Finally, generic top-level domains include traditional ones like .com, .edu, and .org and all the newer ones like .party, .audio, and so on. These domains also include names in different languages.

Currently, there are 1,589 top-level domains. The Internet Assigned Numbers Authority (IANA) delegates the management of various domains to different organizations. Each top-level domain has an administrator who is in charge of it. The IANA itself only keeps track of the root servers. These root servers tell you who to contact for information about top-level domains.

Shared registration

Domain name allocation and management is done through a system of shared registration. The domain name registry is the master database of all domain names that are registered under a top-level domain.

The domain name registry operator is the company that is in charge of this database. These operators run a NIC – a network information center – that tracks information about specific domains. The list of registry operators can be found at icann.org.

Then there’s the domain name registrar. This is the company that you use to register a domain name. There can be many registrars for each top-level domain and each registrar can handle registrations for multiple top-level domains. The registrars consult and update the master database that’s managed at the Registry Operator’s NIC. The database of domain name registrars can be found at iana..org.

Currently, 2,661 registrars provide registration services for various domains. Of these 1,202 are registrars for DropCatch.com, which is a collector of expiring domains. Dropcatch has so many registrars because the domain name registries allow each registrar to contact them only at a limited frequency. This allows Dropcatch to check registries essentially constantly to pick up domain names that just expired.

The registrar you choose becomes the designated registrar for your domain. It’s the company you go through for any changes since you cannot contact the registry directly. The registry operator keeps the central registry database for the top-level domain. Only the designated registrar, the company you registered your domain name with, can make changes for that domain name unless you invoke a domain transfer to another registrar.

For example, the company Namecheap is the designated registrar for the domain poopybrain.com and Verisign is the registry operator for the .com top-level domain. This means that Namecheap sends information about poopybrain.com to Verisign.

Mapping names to addresses

The problem that we need to solve now is that we have two completely different things: IP addresses and domain names. They are assigned separately and are generally unrelated to each other.

We need a way to be able to resolve human-friendly domain names into IP addresses that software can use to send and receive data.

Original solution

The original solution, as we saw, was to download the file containing the list of all computer names on the Internet along with their corresponding addresses onto your own system. Then, local software on your system can search for a name and find the address.

This was the system in place throughout the 1970s and 1980s. The file would be downloaded via FTP from the Network Information Center (NIC) at the Stanford Research Institute (SRI).

Of course, this solution did not scale to millions of hosts on the Internet. Not only would the file get big but there’s also a lot of churn in the data. Hosts are constantly being added and deleted and many addresses are frequently changing.

The Domain Name System

The Domain Name System (DNS) was designed to serve as a planet-wide distributed database that stores information about domain names and enables hosts on the Internet to query them. It’s built as a hierarchy of name servers. A name server runs a service where you give it a name and it gives you information about the name.

DNS is an application-layer protocol. It’s not needed in the Internet protocol stack. IP (sockets, routers, TCP, UDP) strictly works with IP addresses. DNS is built for humans. Computers at the edge of the network resolve names into addresses and, after that, the network only uses addresses.

No relationship between names and addresses

It’s useful to underscore that no relationship exists between names and addresses. You can define any name to point to any address or as an alias to any other system. For instance, the domain cs.poopybrain.com is an alias for cs.rutgers.edu. It can also be configured to resolve to the IP address for cs.rutgers.edu or any other system on the planet. That mapping is up to the owner of poopybrain.com, not rutgers.edu, which owns the destination address.

DNS provides…

DNS servers provide answers to various types of information about domain names. Some of the data they provide includes:

Addresses
Perhaps most importantly, they give us an IP address that corresponds to a name.
Aliases
They can also provide aliases. These are called canonical name records, where you specify that one name really refers to another name.
Name servers
They identify name servers. These are other DNS servers that tell you where to go for more information about that domain.
Mail servers
They give you names of mail servers for that domain
Text data
They can provide arbitrary other data in text records.

DNS servers enable load distribution because you can have lots of name servers that can handle queries for the same domain. DNS servers cache previous lookups to return responses faster the next time someone looks up the same domain name.

They can also provide a list of IP addresses for a given domain name. This allows the client to contact any one of several IP addresses to find available servers or to do load balancing. Some DNS servers shuffle that list of IP addresses for successive queries so that different clients will likely choose different addresses even if they use a simple approach such as choosing the first address.

DNS is distributed & hierarchical

DNS is structured to mirror the domain hierarchy of domain names. The root of the hierarchy knows about the DNS servers that are responsible for top-level domains.

Each top-level DNS server knows about the DNS servers for each domain immediately beneath it: the edu DNS servers will know about the DNS servers for rutgers.edu, columbia.edu, nyu.edu, and so on.

Descending dee[er into the hierarchy, DNS servers are responsible for names within individual organizations.

Authoritative servers

DNS has a concept of zones and authoritative servers. A zone is just a group of machines under a node in the domain tree that’s managed by one entity. For instance, rutgers.edu is a zone.

An authoritative name server is the DNS server that is configured for that zone rather than some other DNS server that might have cached information about that zone.

Finding your way…

Suppose you want to contact a system at Rutgers. You need its address. That’s handled by a DNS server that Rutgers administers. How do we find it?

The domain registry helps us here. When you register a domain with a domain registrar, you provide it with the addresses of DNS servers that can answer queries about the domain. The domain registrar stores this information at the domain registry.

Root name servers

We know that the information about some computer in Rutgers is sitting in a DNS server that Rutgers administers. That doesn’t help us if we don’t know how to get to that DNS server. To find the server we need, we can start at the root of the DNS hierarchy.

Root name servers can tell you the addresses of DNS servers responsible for all the top-level domains. By asking any root DNS server about the computer at Rutgers, it will provide the addresses for DNS servers that are responsible for the edu domain.

There are 13 root name servers. This list can be downloaded from internic.net if you’re setting up your own DNS server at: https://www.internic.net/domain/named.root

The root servers have names like A.ROOT_SERVERS.NET, B.ROOT_SERVERS.NET, and so on. In actuality, there are more than 13 physical servers. Each server is a load-balanced set of computers. The file identifies the name of each server and its corresponding IPv4 and IPv6 addresses. It looks like this:

.                        3600000      NS    A.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET.      3600000      A     198.41.0.4
A.ROOT-SERVERS.NET.      3600000      AAAA  2001:503:ba3e::2:30
; 
.                        3600000      NS    B.ROOT-SERVERS.NET.
B.ROOT-SERVERS.NET.      3600000      A     199.9.14.201
B.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:200::b
; 
.                        3600000      NS    C.ROOT-SERVERS.NET.
C.ROOT-SERVERS.NET.      3600000      A     192.33.4.12
C.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2::c
;
.                        3600000      NS    D.ROOT-SERVERS.NET.
D.ROOT-SERVERS.NET.      3600000      A     199.7.91.13
D.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2d::d
...

DNS Query types

There are two ways queries are done via DNS: iterative resolution and recursive resolution.

Iterative resolution

With iterative resolution, a DNS server returns either an answer or a referral to another DNS server.

A referral is a message that tells you about a DNS server at a lower level in the domain hierarchy. The DNS client must process these referrals by submitting queries to those servers. For example, suppose you want to look up www.rutgers.edu.

Your DNS client will contact any one of the root name servers for www.rutgers.edu. It won’t know about the Rutgers.edu but it will return a referral to name servers for the edu domain: a list of names of DNS servers that handle edu and their corresponding addresses:

a.edu-servers.net.	172800	IN	A	192.5.6.30
a.edu-servers.net.	172800	IN	AAAA	2001:503:a83e::2:30
b.edu-servers.net.	172800	IN	A	192.33.14.30
b.edu-servers.net.	172800	IN	AAAA	2001:503:231d::2:30
c.edu-servers.net.	172800	IN	A	192.26.92.30
c.edu-servers.net.	172800	IN	AAAA	2001:503:83eb::30
d.edu-servers.net.	172800	IN	A	192.31.80.30
d.edu-servers.net.	172800	IN	AAAA	2001:500:856e::30
e.edu-servers.net.	172800	IN	A	192.12.94.30
e.edu-servers.net.	172800	IN	AAAA	2001:502:1ca1::30
f.edu-servers.net.	172800	IN	A	192.35.51.30
f.edu-servers.net.	172800	IN	AAAA	2001:503:d414::30
...

You’ll then contact one of these edu name servers for www.rutgers.edu.

It also doesn’t know about the full domain name but will return a referral to the name servers for rutgers.edu:

dns2.rutgers.edu.	172800	IN	A	130.156.133.30
dns2.rutgers.edu.	172800	IN	AAAA	2607:f3b0:133:1::2
ns1.rutgers.edu.	172800	IN	A	165.230.252.14
ns1.rutgers.edu.	172800	IN	AAAA	2620:0:d60:2::2
ru-ufl.rutgers.edu.	172800	IN	A	128.227.128.162
...

When you contact the name server at rutgers.edu, it will return a definitive answer for the domain name:

www.rutgers.edu.	3600	IN	A	128.6.46.88

This goes on behind the scenes: your program just calls a library to look up a domain name (which may be incorporated into the sockets interface in languages such as Python or Java). Note that a zone, and hence a name server, does not have to be responsible for just a single level of the domain hierarchy. For instance, if we were looking up www.cs.rutgers.edu, the name server for rutgers.edu would give us an answer of:

www.cs.rutgers.edu.	3600	IN	CNAME	dev6.cs.rutgers.edu.
dev6.cs.rutgers.edu.	3600	IN	A	128.6.48.178

This tells us that www.cs.rutgers.edu is an alias (CNAME = canonical name) for a node called dev6.cs.rutgers.edu and the IP address for dev6.cs.rutgers.edu (A = address record) 128.6.48.178. There could be, but isn’t, a referral to a name server that is responsible for the cs.rutgers.edu domain. This is simply an administrative decision and the way Rutgers decided to manage their name space.

The advantage of iterative resolution is that each component is stateless. It either has an answer, provides a referral, or it fails the query.

Recursive resolution

Recursive name resolution isn’t a great name because we’re not really using recursion. Recursive resolution means that a name server is willing to take on the responsibility of fully resolving the name so the client doesn’t have to deal with referrals. Basically, it does a sequence of iterative resolutions until it finds a name server that gives it the answer or it gives up if it’s unable to find one.

The DNS server never sends back referrals to the client that made the request. Instead, it will query all the needed DNS servers to find the domain name, handle the referrals itself, and then return either the answer or a failure to the client that made the query.

The good part about recursive resolution is that the client doesn’t need to deal with referrals and DNS servers can cache all the intermediate results they discovered to make query resolution quicker in the future.

While recursive resolution makes life easier for the process that is making the request, the disadvantage of this approach is that the name server has more work to do. It may have to issue multiple queries and process responses to resolve the domain name, maintaining the context of the query until the response is sent.

Top-level DNS servers only handle only iterative queries. They want to remain stateless, handle simple local lookups, and be able to support a heavy query volume with minimal effort.

Resolvers in action

Most computers run a service called a DNS stub resolver. This is a mini DNS server that stores and checks cached lookups so that the computer does not have to waste time contacting a remote service each time it needs to find the address of google.com or any other frequently accessed domains. Prior to issuing a remote query, the stub resolver also checks a local hosts file (hosts.txt on Windows systems) to see if there are any pre-configured name-to-address mappings.

If an answer cannot be found in the cache or in the hosts file, the stub resolver then contacts a DNS server, often one provided by the ISP or a public DNS server such Cloudflare (1.1.1.1), Google Public DNS (8.8.8.8), Quad9 (9.9.9.9), OpenDNS (208.67.222.222) or one of several other free DNS services.

To summarize, DNS is special-purpose system but a great example of a distributed software system that runs on millions of systems throughout the world and is used constantly by everyone who accesses any Internet services.

Last modified October 29, 2023.
recycled pixels