CS Fundamentals

Hashing Explained: Hash Functions, Hash Tables, and Cryptographic Hashes

Hashing appears everywhere in software: in the dictionary and map types you use every day, in the password storage that protects user accounts, in the Git commit IDs you reference in pull requests. It is one concept with two very different families of application.

Published June 22, 2026

A hash function takes an input of arbitrary size and produces a fixed-size output, called a hash, digest, or hash value. The function is deterministic: the same input always produces the same output. Two different inputs that produce the same hash are called a collision. Beyond those shared properties, the requirements for a hash function diverge sharply depending on its purpose — which is why the hash function in your dictionary and the hash function protecting your passwords are completely different algorithms.

Hash tables: constant-time lookup through array indexing

The hash table (called a dict in Python, HashMap in Java, object or Map in JavaScript) is the data structure that provides O(1) average-case insertion, deletion, and lookup. That speed comes from converting the lookup key directly into an array index.

Internally, a hash table maintains a backing array of some capacity m. When you insert a key-value pair, the table computes hash(key) % m to get an index, and stores the pair there. When you look up a key, the same computation tells you exactly which slot to check. No search needed — it is a single arithmetic operation followed by a single array access.

The requirement for a good non-cryptographic hash function: it should distribute keys uniformly across the available slots. If all keys hash to the same slot, the hash table degenerates into a linked list and lookups become O(n). A good hash function makes that pathological case extremely unlikely for typical key distributions.

Collisions and how hash tables handle them

Because the output space (0 to m-1) is finite and the input space (all possible keys) is not, collisions are mathematically inevitable. A hash table with 100 slots and 100 keys will have collisions even with a perfect hash function, for the same reason that among 23 randomly chosen people, there is a better-than-even chance two share a birthday: the pigeonhole principle plus probability.

Two common collision resolution strategies:

Chaining: each slot holds a linked list (or a small dynamic array) of all key-value pairs that hashed to that slot. Insertion appends to the list; lookup traverses it comparing keys. When the load factor (number of entries divided by number of slots) is kept below about 0.75, the expected list length is short and lookups remain near-constant time.

Open addressing: all entries are stored directly in the backing array; there are no linked lists. When a collision occurs, the table probes subsequent slots (by linear probing, quadratic probing, or double hashing) until an empty slot is found. Lookup probes the same sequence until it finds the key or an empty slot. Open addressing has better cache performance than chaining because all data is in the contiguous backing array, but deletion becomes tricky (a deleted slot must be marked as "tombstone" rather than empty, or the probe sequence breaks for subsequent lookups).

Python's dict uses open addressing with a custom probing formula. Java's HashMap uses chaining, switching from a linked list to a red-black tree when a single bucket's chain exceeds 8 entries. Both maintain a load factor threshold and resize the backing array (rehashing all existing entries) when it is exceeded:

from sys import getsizeof

d = {}
for i in range(10):
    d[i] = i
    print(f'len={len(d)}, size={getsizeof(d)} bytes')

You will see the size jump at powers of two as Python resizes the backing array. Resizing is O(n) but happens infrequently enough that the amortized cost per insertion remains O(1).

Hash functions for data structures: properties and examples

For hash table use, speed is paramount. A hash function that takes a millisecond to compute defeats the purpose of O(1) lookup. The required properties are: deterministic output, good distribution (few collisions), and fast computation. Security properties such as collision resistance and preimage resistance are not needed.

Common non-cryptographic hash functions include MurmurHash3, xxHash, and FNV-1a. These process input in 4- or 8-byte chunks, applying multiply-and-shift operations optimized for CPU instruction pipelines. On modern hardware they can hash gigabytes per second. Python's built-in hash() function (for integers, strings, and tuples) uses SipHash-1-3 since Python 3.4, which is slightly slower than pure-speed hashes but provides a security property: it is keyed with a random seed at interpreter startup, which prevents hash flooding attacks where an adversary crafts inputs that all collide in a web server's request-parsing hash table, causing O(n) lookups and a denial-of-service.

Cryptographic hash functions: one-way and collision-resistant

A cryptographic hash function has additional hard requirements:

Preimage resistance: given a hash output h, it must be computationally infeasible to find any input x such that hash(x) = h.
Second preimage resistance: given an input x, it must be computationally infeasible to find a different input x' such that hash(x) = hash(x').
Collision resistance: it must be computationally infeasible to find any pair (x, x') where x ≠ x' but hash(x) = hash(x').

SHA-256 (from the SHA-2 family) and SHA-3 satisfy these properties. MD5 and SHA-1 do not — practical collision attacks exist for both, which is why they are no longer acceptable for security applications. Git, which uses SHA-1 for commit and object IDs, is in the process of migrating to SHA-256 for precisely this reason.

SHA-256 produces a 256-bit (32-byte) digest. In Python:

import hashlib

data = b'Hello, world!'
digest = hashlib.sha256(data).hexdigest()
print(digest)
# 315f5bdb76d078c43b8ac0064e4a0164612b1fce77c869345bfc94c75894edd3

Changing a single byte of the input produces a completely different digest — the avalanche effect. This makes SHA-256 suitable for data integrity checks (verifying that a downloaded file has not been tampered with), digital signatures, and constructing Merkle trees (used in Git, blockchains, and certificate transparency logs).

Password hashing: why bcrypt is not SHA-256

SHA-256 is too fast to store passwords. Modern GPUs can compute billions of SHA-256 hashes per second. An attacker who obtains a database of SHA-256-hashed passwords can try every password in a dictionary plus common mutations in minutes.

Password hashing requires a deliberately slow function. bcrypt, scrypt, and Argon2 are designed to be tunable: a cost parameter controls how much work the function does. By setting the cost high enough that each hash takes, say, 200 ms on current hardware, you make brute-force attacks 10,000× harder without meaningfully affecting login latency for legitimate users (who hash once per login).

These functions also incorporate a random per-password salt, stored alongside the hash. The salt ensures that two users with the same password produce different hashes, defeating precomputed rainbow tables. Argon2 (winner of the Password Hashing Competition in 2015) is the current recommended choice, with memory-hard computation that resists GPU and ASIC acceleration:

import argon2

ph = argon2.PasswordHasher()
hashed = ph.hash('correct-horse-battery-staple')
ph.verify(hashed, 'correct-horse-battery-staple')   # True
ph.verify(hashed, 'wrong-password')                 # raises VerifyMismatchError

Choosing the right hash for the job

The decision is straightforward once the use case is clear: for hash tables and caches, use the language's built-in hash (which is tuned for speed and distribution). For data integrity and content addressing (detecting accidental corruption, deduplication, Git objects), use SHA-256. For digital signatures and public-key infrastructure, use SHA-256 or SHA-3 as specified by the protocol. For password storage, use Argon2id with a memory parameter of at least 64 MB and a time parameter of at least 3 iterations. For HMAC-based message authentication (verifying that an API request was signed by someone who knows a secret key), use HMAC-SHA-256. Never use MD5 or SHA-1 for anything security-sensitive; their breakage is well-documented and the migration cost only grows the longer it is deferred.