Cryptographic Hashing — Fingerprinting Data
We’ve covered encryption — symmetric and asymmetric. Encryption is about keeping data secret. But there’s a different problem that comes up just as often: how do you prove that data hasn’t been tampered with?
If you download a Linux ISO, how do you know nobody modified it during transit? If a server stores your password, how does it verify your login without storing the actual password? If someone signs a document, how do you know the document hasn’t been altered since?
The answer to all of these is the same: cryptographic hash functions.
A hash function takes an input of any size and produces a fixed-size output (the hash, or digest). The same input always produces the same output. But — and this is the critical part — you can’t reverse the process. You can’t recover the input from the hash.
Think of it as a fingerprint. A fingerprint uniquely identifies a person, but you can’t reconstruct the person from the fingerprint.
Properties of Cryptographic Hash Functions
Not every hash function is a cryptographic hash function. CRC32 is a hash function, but it’s not cryptographic. A cryptographic hash must satisfy several properties:
1. Deterministic
The same input always produces the same output. No randomness.
SHA-256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
SHA-256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 (always)
2. Fixed Output Size
Regardless of input size, the output is always the same length.
SHA-256("a") = ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb (64 hex chars)
SHA-256("War and Peace full text") = 6d7f9e2a... (still 64 hex chars)
SHA-256("") = e3b0c44298fc1c149afbf4c8996fb924... (still 64 hex chars)
SHA-256 always outputs 256 bits (32 bytes, 64 hex characters), whether you hash a single byte or a terabyte.
3. Pre-image Resistance (One-Way)
Given a hash h, it should be computationally infeasible to find any input m such that hash(m) = h.
You can go forward (input to hash) but not backward (hash to input). This is what makes password hashing work — even if the hash is stolen, the attacker can’t reverse it to get the password.
4. Second Pre-image Resistance
Given an input m1, it should be infeasible to find a different input m2 such that hash(m1) = hash(m2).
If someone gives you a document and its hash, you shouldn’t be able to create a different document with the same hash. This protects integrity.
5. Collision Resistance
It should be infeasible to find any two different inputs m1 and m2 such that hash(m1) = hash(m2).
This is stronger than second pre-image resistance. Here, the attacker gets to choose both inputs.
6. Avalanche Effect
A tiny change in the input should produce a completely different hash — roughly half the output bits should flip.
SHA-256("hello") = 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
SHA-256("hello.") = 5b41362b...c4ce39b7... (completely different)
SHA-256("Hello") = 185f8db32271fe25f561a6fc938b2e26... (completely different)
Changing a single character — or even a single bit — produces an entirely unpredictable new hash.
Summary Table
| Property | Meaning | Broken If… |
|---|---|---|
| Pre-image resistance | Can’t find input from hash | Attacker reverses a hash |
| Second pre-image resistance | Can’t find a different input with the same hash | Attacker forges a document matching a given hash |
| Collision resistance | Can’t find any two inputs with the same hash | Attacker creates two documents with the same hash |
The Birthday Paradox and Collision Bounds
How hard is it to find a collision? Intuitively, if the hash output is n bits, there are 2^n possible hash values. You might think you’d need to try 2^n inputs to find a collision. But the birthday paradox says otherwise.
The birthday paradox: in a room of 23 people, there’s a ~50% chance that two share the same birthday. That’s surprisingly low — with 365 possible birthdays, you’d naively expect to need about 183 people.
The math generalizes: to find a collision in an n-bit hash, you only need about 2^(n/2) attempts, not 2^n.
| Hash | Output Size | Brute-force Pre-image | Collision (Birthday) |
|---|---|---|---|
| MD5 | 128 bits | 2^128 | 2^64 |
| SHA-1 | 160 bits | 2^160 | 2^80 |
| SHA-256 | 256 bits | 2^256 | 2^128 |
| SHA-512 | 512 bits | 2^512 | 2^256 |
This is why SHA-256 provides “128 bits of collision resistance” — the birthday bound halves the effective security.
A Tour of Hash Functions
MD5 — Message Digest 5 (Broken)
Designed by Ronald Rivest in 1991. Output: 128-bit hash.
$ echo -n "hello" | md5sum
5d41402abc4b2a76b9719d911017c592
Status: Broken. Collisions were demonstrated in 2004 by Xiaoyun Wang. By 2008, researchers created rogue CA certificates using MD5 collisions. In 2012, the Flame malware used an MD5 collision to forge a Microsoft code-signing certificate.
Do not use MD5 for anything security-related. It’s still used for non-security checksums (file deduplication, cache keys), but never for integrity or authentication.
SHA-1 — Secure Hash Algorithm 1 (Broken)
Designed by the NSA, published in 1995. Output: 160-bit hash.
$ echo -n "hello" | sha1sum
aaf4c61ddcc5e8a2dabede0f3b482cd9aea9434d
Status: Broken. Theoretical weaknesses were known since 2005. In 2017, Google and CWI Amsterdam demonstrated the first practical SHA-1 collision in the SHAttered attack — producing two different PDF files with the same SHA-1 hash. It required ~6,500 GPU-years of computation (about $110,000 on cloud compute at the time).
SHA-1 has been deprecated by browsers (since 2017) and certificate authorities. Git still uses SHA-1 for commit hashes, though it’s migrating to SHA-256.
SHA-2 Family (Current Standard)
Designed by the NSA, published in 2001. The SHA-2 family includes several variants:
| Variant | Output Size | Internal State | Block Size |
|---|---|---|---|
| SHA-224 | 224 bits | 256 bits | 512 bits |
| SHA-256 | 256 bits | 256 bits | 512 bits |
| SHA-384 | 384 bits | 512 bits | 1024 bits |
| SHA-512 | 512 bits | 512 bits | 1024 bits |
SHA-256 is the most widely used. It’s used in TLS certificates, Bitcoin mining, digital signatures, HMAC, password hashing (as a building block), and countless other applications.
$ echo -n "hello" | sha256sum
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824
$ echo -n "hello" | sha512sum
9b71d224bd62f3785d96d46ad3ea3d73319bfbc2890caadae2dff72519673ca72323c3d99ba5c11d7c7acc6e14b8c5da0c4663475c2e5c3adef46f73bcdec043
Status: No practical weaknesses found. SHA-2 remains the workhorse of modern cryptography.
SHA-3 (Keccak)
SHA-3 was standardized in 2015 as the result of a NIST competition (similar to how AES replaced DES). The winner was Keccak, designed by Guido Bertoni, Joan Daemen (co-designer of AES), Michael Peeters, and Gilles Van Assche.
SHA-3 is not a replacement for SHA-2 — SHA-2 is still perfectly secure. SHA-3 exists as an alternative with a completely different internal structure (a sponge construction vs the Merkle-Damgard construction used by MD5/SHA-1/SHA-2). If SHA-2 were ever broken due to a structural flaw, SHA-3 wouldn’t be affected.
$ echo -n "hello" | openssl dgst -sha3-256
3338be694f50c5f338814986cdf0686453a888b84f424d792af4b9202398f392
SHA-3 also introduced SHAKE — extendable-output hash functions where you can request any output length.
BLAKE2 and BLAKE3
BLAKE2 (2012) and BLAKE3 (2020) are modern hash functions that are significantly faster than SHA-2 while maintaining equivalent security. BLAKE3 is parallelizable and can hash data at memory bandwidth speeds.
Speed comparison (cycles per byte):
SHA-256: ~15 cpb
SHA-512: ~10 cpb
BLAKE2b: ~3.5 cpb
BLAKE3: ~0.5 cpb (with SIMD)
BLAKE2 is used in Argon2 (password hashing), WireGuard (VPN), and many other modern systems.
How SHA-256 Works Internally
Let’s look inside SHA-256 to understand what a hash function actually does with your data.
The Merkle-Damgard Construction
SHA-256 (and MD5, SHA-1) uses the Merkle-Damgard construction:
Message → [Padding] → [Split into 512-bit blocks]
|
┌────────────────────────┘
v
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Block 1 │ │ Block 2 │ │ Block 3 │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│Compress │ │Compress │ │Compress │
│ Function│ │ Function│ │ Function│
│ (f) │ │ (f) │ │ (f) │
└────┬────┘ └────┬────┘ └────┬────┘
IV ──────┘ h1 ────────┘ h2 ────────┘
│
v
Final Hash (256 bits)
- Pad the message to a multiple of 512 bits
- Split into 512-bit blocks
- Process each block through a compression function, chaining the output of one block into the input of the next
- The initial value (IV) is a set of fixed constants (the first 32 bits of the fractional parts of the square roots of the first 8 primes)
Padding
The message is padded to a multiple of 512 bits:
- Append a single
1bit - Append zeros until the message is 448 mod 512 bits
- Append the original message length as a 64-bit integer
This ensures every message has a unique padded form.
The Compression Function
The compression function is the core of SHA-256. It takes:
- A 256-bit chaining value (8 × 32-bit words:
a, b, c, d, e, f, g, h) - A 512-bit message block
And produces a new 256-bit chaining value through 64 rounds.
Each round:
# SHA-256 round function (one of 64 rounds)
S1 = right_rotate(e, 6) XOR right_rotate(e, 11) XOR right_rotate(e, 25)
ch = (e AND f) XOR ((NOT e) AND g)
temp1 = h + S1 + ch + K[i] + W[i]
S0 = right_rotate(a, 2) XOR right_rotate(a, 13) XOR right_rotate(a, 22)
maj = (a AND b) XOR (a AND c) XOR (b AND c)
temp2 = S0 + maj
h = g
g = f
f = e
e = d + temp1
d = c
c = b
b = a
a = temp1 + temp2
Where:
K[i]— Round constants (the first 32 bits of the fractional parts of the cube roots of the first 64 primes)W[i]— Message schedule words (derived from the 512-bit block through expansion)- All additions are modulo 2^32
After all 64 rounds, the resulting a,b,c,d,e,f,g,h are added to the input chaining value (modulo 2^32) to produce the output.
The key operations are:
- Bitwise rotations and shifts — Provide diffusion
- Ch (Choice) and Maj (Majority) — Nonlinear functions that mix the state
- Round constants K[i] — Prevent symmetry and related patterns
- Modular addition — Mixes bits across word boundaries
Where Hashes Are Used
1. Password Storage
We covered this in depth in the Authentication Deep Dive. Never store passwords in plaintext. Store hash(salt + password) instead. Use bcrypt, scrypt, or Argon2 — not raw SHA-256.
2. File Integrity
Verify that a downloaded file hasn’t been corrupted or tampered with:
$ sha256sum ubuntu-24.04-desktop-amd64.iso
e3b0c44298fc1c149afbf4c8996fb924... ubuntu-24.04-desktop-amd64.iso
# Compare with the hash on the website
3. Digital Signatures
You don’t sign the entire document — you sign the hash. This is both faster (the hash is small and fixed-size) and necessary for the math of RSA signatures and ECDSA.
Sign: signature = RSA_Encrypt(SHA-256(document), private_key)
Verify: SHA-256(document) == RSA_Decrypt(signature, public_key)?
4. HMAC — Hash-based Message Authentication Code
HMAC combines a hash with a secret key to produce an authentication tag:
HMAC(K, m) = hash((K XOR opad) || hash((K XOR ipad) || m))
Where opad and ipad are fixed padding constants. HMAC provides both integrity and authentication — only someone with the key can produce a valid tag.
HMAC-SHA256 is used extensively in APIs, JWT signatures (HS256), and message authentication.
5. Key Derivation (HKDF)
Derive cryptographic keys from a shared secret or password:
HKDF-SHA256(input_key_material, salt, info) → derived_key
Used in TLS 1.3 to derive session keys from the Diffie-Hellman shared secret.
6. Blockchain / Proof of Work
Bitcoin mining is essentially a brute-force search for a value where:
SHA-256(SHA-256(block_header + nonce)) < target
Miners try billions of nonce values per second until they find one that produces a hash below the target difficulty. The one-way property of SHA-256 means there’s no shortcut — you just have to keep guessing.
7. Git
Every Git commit, tree, and blob is identified by its SHA-1 hash:
$ git log --oneline
f991604 Add two new posts on Identity Providers and IAM Security
310c108 Add tutorial on reversing loops with GDB
Those hex strings are the first 7 characters of the SHA-1 hash of each commit object. Git is migrating to SHA-256.
8. Merkle Trees
A Merkle tree hashes pairs of data blocks, then hashes the hashes, building a tree where the root hash represents the integrity of the entire dataset. Used in Git, blockchains, certificate transparency logs, and distributed filesystems.
Root Hash
/ \
Hash(AB) Hash(CD)
/ \ / \
Hash(A) Hash(B) Hash(C) Hash(D)
| | | |
Data A Data B Data C Data D
Changing a single byte in Data C changes Hash(C), which changes Hash(CD), which changes the Root Hash. You can verify the integrity of any single block by checking its path to the root.
Real Attacks — When Hashing Goes Wrong
MD5 Collision — Rogue CA Certificate (2008)
Researchers created a rogue Certificate Authority certificate with the same MD5 hash as a legitimate certificate. They could then issue fake SSL certificates for any website, trusted by all browsers. This attack was demonstrated at 25C3 and led to the deprecation of MD5 in certificates.
SHAttered — SHA-1 Collision (2017)
Google and CWI Amsterdam produced two different PDF files with the same SHA-1 hash. The attack required about 9 × 10^18 SHA-1 computations — equivalent to ~6,500 GPU-years. The PDFs had different visual content but identical SHA-1 hashes.
This was a chosen-prefix collision — the most useful type for an attacker, because they can control the content of both files.
Length Extension Attacks
Hash functions based on the Merkle-Damgard construction (MD5, SHA-1, SHA-256) are vulnerable to length extension attacks. If you know hash(message) but not message, you can compute hash(message || padding || extension) without knowing the original message.
This breaks naive MAC constructions like hash(secret || message). An attacker can extend the message and compute a valid hash.
The fix: use HMAC (which processes the key twice) or use SHA-3/BLAKE2 (which aren’t vulnerable to length extension).
Timing Side Channels
Comparing hashes byte-by-byte with early exit (if hash[0] != expected[0]: return false) leaks information through timing. An attacker can determine how many bytes match by measuring response time.
The fix: always use constant-time comparison:
import hmac
# WRONG — timing leak
if computed_hash == expected_hash:
return True
# RIGHT — constant-time comparison
if hmac.compare_digest(computed_hash, expected_hash):
return True
Quick Reference — Which Hash to Use
| Use Case | Recommended | Avoid |
|---|---|---|
| General integrity | SHA-256 | MD5, SHA-1 |
| Password storage | Argon2, bcrypt, scrypt | SHA-256 (too fast), MD5 |
| HMAC (message auth) | HMAC-SHA256 | HMAC-MD5 |
| Digital signatures | SHA-256, SHA-384, SHA-512 | SHA-1, MD5 |
| Key derivation | HKDF-SHA256, Argon2 | Raw SHA-256 |
| Performance-critical | BLAKE3 | SHA-256 (if speed matters) |
| Post-quantum safety | SHA-256+ (hashes resist quantum) | N/A |
One important note: hash functions are quantum-resistant. Grover’s algorithm reduces pre-image resistance from 2^n to 2^(n/2), but SHA-256’s 2^128 post-Grover security is still more than sufficient. Unlike RSA and ECC, hash functions don’t need to be replaced for quantum computing.
Final Thoughts
Hash functions are the unsung heroes of cryptography. They don’t get the attention that encryption algorithms get, but they’re everywhere — in every TLS handshake, every Git commit, every blockchain block, every password database, every digital signature.
What makes them powerful is their simplicity. A hash function does one thing: take any input, produce a fixed-size fingerprint, and make it impossible to reverse. From that single primitive, we build password storage, file integrity, message authentication, digital signatures, proof-of-work, and content-addressed storage.
The evolution from MD5 (broken) to SHA-1 (broken) to SHA-2 (secure) to SHA-3 (insurance policy) shows that hash functions, like all cryptographic primitives, are not permanent. What’s secure today may not be secure tomorrow. But understanding how they work — the Merkle-Damgard construction, the compression function, the birthday bound — gives you the foundation to evaluate new designs as they emerge.
Thanks for reading!