Part 1: Public Key Cryptography
Part 2: Hash functions
Part 3: Integrity Mechanisms
Part 4: Diffie-Hellman
Part 5: Putting It All Together
Part 6: Quantum Attacks and Post-Quantum Cryptography

Hash Functions

Introduction

Public key algorithms gave us the ability to encrypt data and to sign values with a private key. However, they are inefficient if applied directly to large messages. Encrypting or signing a multi-megabyte file with RSA or ECC would be impractical. We need a way to reduce data to a compact form without losing the ability to detect changes. This is where hash functions come in.

A cryptographic hash function takes an input of arbitrary length and produces a fixed-size output, often called a digest. For example, SHA-256 maps any message into a 256-bit value. Even if the input is a book or a video file, the output will still be 256 bits. The hash acts as a fingerprint of the data.

Hashes are not new in computing. Operating systems and file systems use checksums and cyclic redundancy checks (CRCs) to detect errors in stored or transmitted data. Cryptographic hashes, however, require much stronger properties than simple error-detecting codes.

Properties of Cryptographic Hash Functions

A cryptographic hash function maps input data of any length into a fixed-size digest. For a function to be secure, it must provide several properties:

Fixed-length: The output of a hash function is a fixed number of bytes, regardless of the length of the input.
Determinism: The same input always produces the same hash value. If this did not hold, we could not use a hash to verify data.
Preimage resistance: Given a hash value \(h\), it should be infeasible to find any input \(m\) such that \(H(m) = h\). This prevents attackers from forging data that matches a target hash.
Second preimage resistance: Given an input \(m_1\), it should be infeasible to find a different input \(m_2\) such that \(H(m_1) = H(m_2)\). This stops attackers from replacing one document with another that has the same hash.
Collision resistance: It should be infeasible to find any two different inputs \(m_1\) and \(m_2\) with \(H(m_1) = H(m_2)\). Collisions must exist because there are more possible inputs than outputs, but they should be rare enough that no one can deliberately construct them.
Avalanche effect: A small change in the input should produce a drastic and unpredictable change in the output. Flipping one bit of input should flip about half of the output bits. This makes patterns invisible.
Efficiency: The function should be fast to compute for any input. This makes it practical for real systems.
Uniformity: Outputs should be evenly distributed across the output space, avoiding bias or patterns.

These properties distinguish a cryptographic hash function from ordinary checksums. CRCs, for example, are excellent at detecting accidental bit errors but are linear and easy to manipulate. Cryptographic hashes must resist deliberate attack.

Collisions and Probability

The pigeonhole principle is a basic idea from mathematics: if you have more items than containers, at least one container must hold more than one item. For example, if you try to put 11 socks into 10 drawers, one drawer will have at least two socks.

In the context of hash functions, there are far more possible inputs than outputs. This means collisions (two inputs mapping to the same hash value) are inevitable. However, with a well-designed function such as SHA-256, the number of possible outputs is so enormous that collisions are astronomically unlikely to occur by chance.

The birthday paradox is a probability result that surprises many people. In a room of just 23 people, the chance that two share a birthday is already over 50%, even though there are 365 possible birthdays. The paradox arises because we are comparing many pairs at once.

For hash functions, this means that collisions can be found in roughly \(2^{n/2}\) attempts for an \(n\)-bit hash, not \(2^n\) as you might first think. For SHA-256, this still requires about \(2^{128}\) operations, which is infeasible with any foreseeable technology.

These ideas explain why collisions are guaranteed in theory but still practically unachievable in secure hash functions. They also highlight the importance of choosing functions with large output sizes.

Examples

Several hash functions have been widely used over time:

MD5: Produced 128-bit outputs. It was widely used in the 1990s but is now broken; collisions can be created easily.
SHA-1: Produced 160-bit outputs. For years it was the standard for digital signatures and certificates. It too has been broken; practical collision attacks were published in 2017.
SHA-2 family: Includes SHA-224, SHA-256, SHA-384, and SHA-512. These remain secure today.
SHA-3: Standardized in 2015. It is based on the Keccak sponge construction and was designed as a backup in case weaknesses were found in SHA-2.

Uses of Hash Functions

Integrity Checks

Hashes are used as fingerprints to detect accidental corruption. Software vendors often publish hash values alongside downloads. Users can recompute the hash of the file they receive and compare it against the published value.

Digital Signatures

Public key algorithms operate on fixed-size values. Instead of signing an entire document, we sign its hash. The sender computes \(h = H(m)\) and encrypts \(h\) with their private key to create a signature. The recipient computes the hash of the received document and compares it with the decrypted signature. If they match, the document has not been altered.

Password Storage

Systems rarely store passwords directly. Instead, they store a salted hash of each password. When a user logs in, the system hashes the entered password with the same salt and compares the result. This means the system never needs to keep plaintext passwords. We will look at this when we discuss authentication.

Other Applications

Hashes appear in many protocols:

Git uses SHA-1 (and now SHA-256) to identify commits.
Bitcoin and other blockchains depend on hashes to link blocks securely.
Hashes support efficient data structures such as hash tables and Merkle trees.

Entropy and Predictability

A good hash function produces outputs that appear random. If the outputs are predictable or biased, attackers can exploit that structure. This is why the avalanche effect is so important. A tiny change in input, such as flipping one bit of a file, should completely change the hash value.

Entropy here refers to unpredictability. High-entropy outputs make it infeasible to guess or compress patterns in the data. If hashes leaked information about the input, they would not protect integrity.

Example: Hashing a File

Suppose we download an installation file and the publisher provides its SHA-256 hash:

3b645a1d4a5c1d5c8f2b915b26e20a... (64 hex characters)

After downloading, we run a hash program on the file. If the value matches, we can be confident the file is intact. If an attacker altered the file in transit, or if it was accidentally modified or only partially downloaded, the hash would not match.

However, if the attacker controls both the file and the hash published alongside it, this simple check does not help. That is why hashes are combined with cryptographic signatures in practice.

Limitations

Hashes provide a compact fingerprint but no protection against an adversary who can modify both the message and the hash. They are the building blocks of integrity, but must be combined with keys to resist attack. This leads us to the next topic: message authentication codes and digital signatures.

Next: Part 3: Integrity Mechanisms

Public Key Cryptography and Integrity