Thilan Dissanayaka Cryptography April 15, 2020

Cryptographic Hashing — Fingerprinting Data

We’ve covered encryption — symmetric and asymmetric. Encryption is about keeping data secret. But there’s a different problem that comes up just as often: how do you prove that data hasn’t been tampered with?

If you download a Linux ISO, how do you know nobody modified it during transit? If a server stores your password, how does it verify your login without storing the actual password? If someone signs a document, how do you know the document hasn’t been altered since?

The answer to all of these is the same: cryptographic hash functions.

A hash function takes an input of any size and produces a fixed-size output (the hash, or digest). The same input always produces the same output. But — and this is the critical part — you can’t reverse the process. You can’t recover the input from the hash.

Think of it as a fingerprint. A fingerprint uniquely identifies a person, but you can’t reconstruct the person from the fingerprint.

Properties of Cryptographic Hash Functions

Not every hash function is a cryptographic hash function. CRC32 is a hash function, but it’s not cryptographic. A cryptographic hash must satisfy several properties:

1. Deterministic

The same input always produces the same output. No randomness.

SHA-256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
SHA-256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824  (always)

2. Fixed Output Size

Regardless of input size, the output is always the same length.

SHA-256("a")                    = ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb  (64 hex chars)
SHA-256("War and Peace full text") = 6d7f9e2a...  (still 64 hex chars)
SHA-256("")                     = e3b0c44298fc1c149afbf4c8996fb924...  (still 64 hex chars)

SHA-256 always outputs 256 bits (32 bytes, 64 hex characters), whether you hash a single byte or a terabyte.

3. Pre-image Resistance (One-Way)

Given a hash h, it should be computationally infeasible to find any input m such that hash(m) = h.

You can go forward (input to hash) but not backward (hash to input). This is what makes password hashing work — even if the hash is stolen, the attacker can’t reverse it to get the password.

4. Second Pre-image Resistance

Given an input m1, it should be infeasible to find a different input m2 such that hash(m1) = hash(m2).

If someone gives you a document and its hash, you shouldn’t be able to create a different document with the same hash. This protects integrity.

5. Collision Resistance

It should be infeasible to find any two different inputs m1 and m2 such that hash(m1) = hash(m2).

This is stronger than second pre-image resistance. Here, the attacker gets to choose both inputs.

6. Avalanche Effect

A tiny change in the input should produce a completely different hash — roughly half the output bits should flip.

SHA-256("hello")  = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
SHA-256("hello.") = 5b41362b...c4ce39b7...  (completely different)
SHA-256("Hello")  = 185f8db32271fe25f561a6fc938b2e26...  (completely different)

Changing a single character — or even a single bit — produces an entirely unpredictable new hash.

Summary Table

Property	Meaning	Broken If…
Pre-image resistance	Can’t find input from hash	Attacker reverses a hash
Second pre-image resistance	Can’t find a different input with the same hash	Attacker forges a document matching a given hash
Collision resistance	Can’t find any two inputs with the same hash	Attacker creates two documents with the same hash

The Birthday Paradox and Collision Bounds

How hard is it to find a collision? Intuitively, if the hash output is n bits, there are 2^n possible hash values. You might think you’d need to try 2^n inputs to find a collision. But the birthday paradox says otherwise.

The birthday paradox: in a room of 23 people, there’s a ~50% chance that two share the same birthday. That’s surprisingly low — with 365 possible birthdays, you’d naively expect to need about 183 people.

The math generalizes: to find a collision in an n-bit hash, you only need about 2^(n/2) attempts, not 2^n.

Hash	Output Size	Brute-force Pre-image	Collision (Birthday)
MD5	128 bits	2^128	2^64
SHA-1	160 bits	2^160	2^80
SHA-256	256 bits	2^256	2^128
SHA-512	512 bits	2^512	2^256

This is why SHA-256 provides “128 bits of collision resistance” — the birthday bound halves the effective security.

A Tour of Hash Functions

MD5 — Message Digest 5 (Broken)

$ echo -n "hello" | md5sum
5d41402abc4b2a76b9719d911017c592

Status: Broken. Collisions were demonstrated in 2004 by Xiaoyun Wang. By 2008, researchers created rogue CA certificates using MD5 collisions. In 2012, the Flame malware used an MD5 collision to forge a Microsoft code-signing certificate.

Do not use MD5 for anything security-related. It’s still used for non-security checksums (file deduplication, cache keys), but never for integrity or authentication.

SHA-1 — Secure Hash Algorithm 1 (Broken)

$ echo -n "hello" | sha1sum
aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d

Status: Broken. Theoretical weaknesses were known since 2005. In 2017, Google and CWI Amsterdam demonstrated the first practical SHA-1 collision in the SHAttered attack — producing two different PDF files with the same SHA-1 hash. It required ~6,500 GPU-years of computation (about $110,000 on cloud compute at the time).

SHA-1 has been deprecated by browsers (since 2017) and certificate authorities. Git still uses SHA-1 for commit hashes, though it’s migrating to SHA-256.

SHA-2 Family (Current Standard)

Variant	Output Size	Internal State	Block Size
SHA-224	224 bits	256 bits	512 bits
SHA-256	256 bits	256 bits	512 bits
SHA-384	384 bits	512 bits	1024 bits
SHA-512	512 bits	512 bits	1024 bits

SHA-256 is the most widely used. It’s used in TLS certificates, Bitcoin mining, digital signatures, HMAC, password hashing (as a building block), and countless other applications.

$ echo -n "hello" | sha256sum
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

$ echo -n "hello" | sha512sum
9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043

Status: No practical weaknesses found. SHA-2 remains the workhorse of modern cryptography.

SHA-3 (Keccak)

SHA-3 was standardized in 2015 as the result of a NIST competition (similar to how AES replaced DES). The winner was Keccak, designed by Guido Bertoni, Joan Daemen (co-designer of AES), Michael Peeters, and Gilles Van Assche.

SHA-3 is not a replacement for SHA-2 — SHA-2 is still perfectly secure. SHA-3 exists as an alternative with a completely different internal structure (a sponge construction vs the Merkle-Damgard construction used by MD5/SHA-1/SHA-2). If SHA-2 were ever broken due to a structural flaw, SHA-3 wouldn’t be affected.

$ echo -n "hello" | openssl dgst -sha3-256
3338be694f50c5f338814986cdf0686453a888b84f424d792af4b9202398f392

SHA-3 also introduced SHAKE — extendable-output hash functions where you can request any output length.

BLAKE2 and BLAKE3

BLAKE2 (2012) and BLAKE3 (2020) are modern hash functions that are significantly faster than SHA-2 while maintaining equivalent security. BLAKE3 is parallelizable and can hash data at memory bandwidth speeds.

Speed comparison (cycles per byte):
SHA-256:   ~15 cpb
SHA-512:   ~10 cpb
BLAKE2b:   ~3.5 cpb
BLAKE3:    ~0.5 cpb (with SIMD)

BLAKE2 is used in Argon2 (password hashing), WireGuard (VPN), and many other modern systems.

How SHA-256 Works Internally

Let’s look inside SHA-256 to understand what a hash function actually does with your data.

The Merkle-Damgard Construction

SHA-256 (and MD5, SHA-1) uses the Merkle-Damgard construction:

Message → [Padding] → [Split into 512-bit blocks]
                                  |
         ┌────────────────────────┘
         v
    ┌─────────┐     ┌─────────┐     ┌─────────┐
    │  Block 1 │     │ Block 2  │     │ Block 3  │
    └────┬────┘     └────┬────┘     └────┬────┘
         │               │               │
    ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
    │Compress │     │Compress │     │Compress │
    │ Function│     │ Function│     │ Function│
    │   (f)   │     │   (f)   │     │   (f)   │
    └────┬────┘     └────┬────┘     └────┬────┘
IV ──────┘    h1 ────────┘    h2 ────────┘
                                         │
                                         v
                                    Final Hash (256 bits)

Pad the message to a multiple of 512 bits
Split into 512-bit blocks
Process each block through a compression function, chaining the output of one block into the input of the next
The initial value (IV) is a set of fixed constants (the first 32 bits of the fractional parts of the square roots of the first 8 primes)

Padding

The message is padded to a multiple of 512 bits:

Append a single 1 bit
Append zeros until the message is 448 mod 512 bits
Append the original message length as a 64-bit integer

This ensures every message has a unique padded form.

The Compression Function

The compression function is the core of SHA-256. It takes:

A 256-bit chaining value (8 × 32-bit words: a, b, c, d, e, f, g, h)
A 512-bit message block

And produces a new 256-bit chaining value through 64 rounds.

Each round:

# SHA-256 round function (one of 64 rounds)
S1 = right_rotate(e, 6) XOR right_rotate(e, 11) XOR right_rotate(e, 25)
ch = (e AND f) XOR ((NOT e) AND g)
temp1 = h + S1 + ch + K[i] + W[i]

S0 = right_rotate(a, 2) XOR right_rotate(a, 13) XOR right_rotate(a, 22)
maj = (a AND b) XOR (a AND c) XOR (b AND c)
temp2 = S0 + maj

h = g
g = f
f = e
e = d + temp1
d = c
c = b
b = a
a = temp1 + temp2

Where:

K[i] — Round constants (the first 32 bits of the fractional parts of the cube roots of the first 64 primes)
W[i] — Message schedule words (derived from the 512-bit block through expansion)
All additions are modulo 2^32

After all 64 rounds, the resulting a,b,c,d,e,f,g,h are added to the input chaining value (modulo 2^32) to produce the output.

The key operations are:

Bitwise rotations and shifts — Provide diffusion
Ch (Choice) and Maj (Majority) — Nonlinear functions that mix the state
Round constants K[i] — Prevent symmetry and related patterns
Modular addition — Mixes bits across word boundaries

Where Hashes Are Used

1. Password Storage

We covered this in depth in the Authentication Deep Dive. Never store passwords in plaintext. Store hash(salt + password) instead. Use bcrypt, scrypt, or Argon2 — not raw SHA-256.

2. File Integrity

Verify that a downloaded file hasn’t been corrupted or tampered with:

$ sha256sum ubuntu-24.04-desktop-amd64.iso
e3b0c44298fc1c149afbf4c8996fb924...  ubuntu-24.04-desktop-amd64.iso

# Compare with the hash on the website

3. Digital Signatures

You don’t sign the entire document — you sign the hash. This is both faster (the hash is small and fixed-size) and necessary for the math of RSA signatures and ECDSA.

Sign:    signature = RSA_Encrypt(SHA-256(document), private_key)
Verify:  SHA-256(document) == RSA_Decrypt(signature, public_key)?

4. HMAC — Hash-based Message Authentication Code

HMAC combines a hash with a secret key to produce an authentication tag:

HMAC(K, m) = hash((K XOR opad) || hash((K XOR ipad) || m))

Where opad and ipad are fixed padding constants. HMAC provides both integrity and authentication — only someone with the key can produce a valid tag.

HMAC-SHA256 is used extensively in APIs, JWT signatures (HS256), and message authentication.

5. Key Derivation (HKDF)

Derive cryptographic keys from a shared secret or password:

HKDF-SHA256(input_key_material, salt, info) → derived_key

Used in TLS 1.3 to derive session keys from the Diffie-Hellman shared secret.

6. Blockchain / Proof of Work

Bitcoin mining is essentially a brute-force search for a value where:

SHA-256(SHA-256(block_header + nonce)) < target

Miners try billions of nonce values per second until they find one that produces a hash below the target difficulty. The one-way property of SHA-256 means there’s no shortcut — you just have to keep guessing.

7. Git

Every Git commit, tree, and blob is identified by its SHA-1 hash:

$ git log --oneline
f991604 Add two new posts on Identity Providers and IAM Security
310c108 Add tutorial on reversing loops with GDB

Those hex strings are the first 7 characters of the SHA-1 hash of each commit object. Git is migrating to SHA-256.

8. Merkle Trees

A Merkle tree hashes pairs of data blocks, then hashes the hashes, building a tree where the root hash represents the integrity of the entire dataset. Used in Git, blockchains, certificate transparency logs, and distributed filesystems.

         Root Hash
         /       \
    Hash(AB)    Hash(CD)
    /    \      /    \
 Hash(A) Hash(B) Hash(C) Hash(D)
    |       |       |       |
   Data A  Data B  Data C  Data D

Changing a single byte in Data C changes Hash(C), which changes Hash(CD), which changes the Root Hash. You can verify the integrity of any single block by checking its path to the root.

Real Attacks — When Hashing Goes Wrong

MD5 Collision — Rogue CA Certificate (2008)

Researchers created a rogue Certificate Authority certificate with the same MD5 hash as a legitimate certificate. They could then issue fake SSL certificates for any website, trusted by all browsers. This attack was demonstrated at 25C3 and led to the deprecation of MD5 in certificates.

SHAttered — SHA-1 Collision (2017)

Google and CWI Amsterdam produced two different PDF files with the same SHA-1 hash. The attack required about 9 × 10^18 SHA-1 computations — equivalent to ~6,500 GPU-years. The PDFs had different visual content but identical SHA-1 hashes.

This was a chosen-prefix collision — the most useful type for an attacker, because they can control the content of both files.

Length Extension Attacks

Hash functions based on the Merkle-Damgard construction (MD5, SHA-1, SHA-256) are vulnerable to length extension attacks. If you know hash(message) but not message, you can compute hash(message || padding || extension) without knowing the original message.

This breaks naive MAC constructions like hash(secret || message). An attacker can extend the message and compute a valid hash.

The fix: use HMAC (which processes the key twice) or use SHA-3/BLAKE2 (which aren’t vulnerable to length extension).

Timing Side Channels

Comparing hashes byte-by-byte with early exit (if hash[0] != expected[0]: return false) leaks information through timing. An attacker can determine how many bytes match by measuring response time.

The fix: always use constant-time comparison:

import hmac

# WRONG — timing leak
if computed_hash == expected_hash:
    return True

# RIGHT — constant-time comparison
if hmac.compare_digest(computed_hash, expected_hash):
    return True

Quick Reference — Which Hash to Use

Use Case	Recommended	Avoid
General integrity	SHA-256	MD5, SHA-1
Password storage	Argon2, bcrypt, scrypt	SHA-256 (too fast), MD5
HMAC (message auth)	HMAC-SHA256	HMAC-MD5
Digital signatures	SHA-256, SHA-384, SHA-512	SHA-1, MD5
Key derivation	HKDF-SHA256, Argon2	Raw SHA-256
Performance-critical	BLAKE3	SHA-256 (if speed matters)
Post-quantum safety	SHA-256+ (hashes resist quantum)	N/A

One important note: hash functions are quantum-resistant. Grover’s algorithm reduces pre-image resistance from 2^n to 2^(n/2), but SHA-256’s 2^128 post-Grover security is still more than sufficient. Unlike RSA and ECC, hash functions don’t need to be replaced for quantum computing.

Final Thoughts

Hash functions are the unsung heroes of cryptography. They don’t get the attention that encryption algorithms get, but they’re everywhere — in every TLS handshake, every Git commit, every blockchain block, every password database, every digital signature.

What makes them powerful is their simplicity. A hash function does one thing: take any input, produce a fixed-size fingerprint, and make it impossible to reverse. From that single primitive, we build password storage, file integrity, message authentication, digital signatures, proof-of-work, and content-addressed storage.

The evolution from MD5 (broken) to SHA-1 (broken) to SHA-2 (secure) to SHA-3 (insurance policy) shows that hash functions, like all cryptographic primitives, are not permanent. What’s secure today may not be secure tomorrow. But understanding how they work — the Merkle-Damgard construction, the compression function, the birthday bound — gives you the foundation to evaluate new designs as they emerge.

Thanks for reading!