pk.org: Computer Security/Lecture Notes

Public Key Cryptography and Integrity

Part 2 - Hash functions

Paul Krzyzanowski – 2025-09-19

Part 1: Public Key Cryptography
Part 2: Hash functions
Part 3: Integrity Mechanisms
Part 4: Diffie-Hellman
Part 5: Putting It All Together
Part 6: Quantum Attacks and Post-Quantum Cryptography


Hash Functions

Introduction

Public key algorithms gave us the ability to encrypt data and to sign values with a private key. However, they are inefficient if applied directly to large messages. Encrypting or signing a multi-megabyte file with RSA or ECC would be impractical. We need a way to reduce data to a compact form without losing the ability to detect changes. This is where hash functions come in.

A cryptographic hash function takes an input of arbitrary length and produces a fixed-size output, often called a digest. For example, SHA-256 maps any message into a 256-bit value. Even if the input is a book or a video file, the output will still be 256 bits. The hash acts as a fingerprint of the data.

Hashes are not new in computing. Operating systems and file systems use checksums and cyclic redundancy checks (CRCs) to detect errors in stored or transmitted data. Cryptographic hashes, however, require much stronger properties than simple error-detecting codes.

Properties of Cryptographic Hash Functions

A cryptographic hash function maps input data of any length into a fixed-size digest. For a function to be secure, it must provide several properties:

These properties distinguish a cryptographic hash function from ordinary checksums. CRCs, for example, are excellent at detecting accidental bit errors but are linear and easy to manipulate. Cryptographic hashes must resist deliberate attack.

Collisions and Probability

The pigeonhole principle is a basic idea from mathematics: if you have more items than containers, at least one container must hold more than one item. For example, if you try to put 11 socks into 10 drawers, one drawer will have at least two socks.

In the context of hash functions, there are far more possible inputs than outputs. This means collisions (two inputs mapping to the same hash value) are inevitable. However, with a well-designed function such as SHA-256, the number of possible outputs is so enormous that collisions are astronomically unlikely to occur by chance.

The birthday paradox is a probability result that surprises many people. In a room of just 23 people, the chance that two share a birthday is already over 50%, even though there are 365 possible birthdays. The paradox arises because we are comparing many pairs at once.

For hash functions, this means that collisions can be found in roughly \(2^{n/2}\) attempts for an \(n\)-bit hash, not \(2^n\) as you might first think. For SHA-256, this still requires about \(2^{128}\) operations, which is infeasible with any foreseeable technology.

These ideas explain why collisions are guaranteed in theory but still practically unachievable in secure hash functions. They also highlight the importance of choosing functions with large output sizes.

Examples

Several hash functions have been widely used over time:


Uses of Hash Functions

Integrity Checks

Hashes are used as fingerprints to detect accidental corruption. Software vendors often publish hash values alongside downloads. Users can recompute the hash of the file they receive and compare it against the published value.

Digital Signatures

Public key algorithms operate on fixed-size values. Instead of signing an entire document, we sign its hash. The sender computes \(h = H(m)\) and encrypts \(h\) with their private key to create a signature. The recipient computes the hash of the received document and compares it with the decrypted signature. If they match, the document has not been altered.

Password Storage

Systems rarely store passwords directly. Instead, they store a salted hash of each password. When a user logs in, the system hashes the entered password with the same salt and compares the result. This means the system never needs to keep plaintext passwords. We will look at this when we discuss authentication.

Other Applications

Hashes appear in many protocols:


Entropy and Predictability

A good hash function produces outputs that appear random. If the outputs are predictable or biased, attackers can exploit that structure. This is why the avalanche effect is so important. A tiny change in input, such as flipping one bit of a file, should completely change the hash value.

Entropy here refers to unpredictability. High-entropy outputs make it infeasible to guess or compress patterns in the data. If hashes leaked information about the input, they would not protect integrity.


Example: Hashing a File

Suppose we download an installation file and the publisher provides its SHA-256 hash:

3b645a1d4a5c1d5c8f2b915b26e20a... (64 hex characters)

After downloading, we run a hash program on the file. If the value matches, we can be confident the file is intact. If an attacker altered the file in transit, or if it was accidentally modified or only partially downloaded, the hash would not match.

However, if the attacker controls both the file and the hash published alongside it, this simple check does not help. That is why hashes are combined with cryptographic signatures in practice.


Limitations

Hashes provide a compact fingerprint but no protection against an adversary who can modify both the message and the hash. They are the building blocks of integrity, but must be combined with keys to resist attack. This leads us to the next topic: message authentication codes and digital signatures.


Next: Part 3: Integrity Mechanisms