WierdX — Programming Reference All tutorials →
Developer reference · Practical tutorials · CS fundamentals
Tools & Techniques

How Git Stores Data: Objects, Trees, and Commits Under the Hood

Git is described as a distributed version control system, but its creators describe it more precisely as a content-addressable filesystem with a VCS interface on top. Understanding that filesystem is what makes branching, rebasing, and the reflog stop feeling like magic.

Published June 22, 2026

Most developers interact with Git through a handful of commands — commit, push, rebase, merge — without a clear picture of what changes on disk. That gap between command and effect produces confusion about detached HEAD state, fear of rebase, and mysterious failures when force-pushes go wrong. The object model resolves most of that confusion.

The object store: four types of objects, one addressing scheme

Everything Git stores lives in .git/objects/. Every object is identified by the SHA-1 hash of its contents (Git is migrating to SHA-256, but the principle is identical). Because the address is derived from the content, two identical files anywhere in the repository's history share a single object — Git automatically deduplicates. This is content-addressed storage.

There are four object types:

Blob: the raw contents of a file, with no filename or metadata. The filename is stored in a tree object (see below), not in the blob. Two files with identical content but different names are stored as one blob object referenced twice.

Tree: a directory listing. Each entry in a tree object has a mode (file permissions, roughly), a name, and a SHA-1 reference to either a blob (for files) or another tree (for subdirectories). A tree is a snapshot of one directory's contents at a point in time.

Commit: a snapshot of the entire working tree, with metadata. A commit object contains a reference to the top-level tree (representing the full directory structure), references to zero or more parent commits, the author name and email and timestamp, the committer name and email and timestamp (which differs from the author when a patch is applied by someone else), and the commit message.

Tag: an annotated tag object wraps another object (usually a commit) with a tag name, tagger identity, date, and message. Lightweight tags are just a reference to a commit; they do not create a tag object.

You can inspect any object directly:

git cat-file -t HEAD           # prints object type: "commit"
git cat-file -p HEAD           # prints the commit object's contents
git cat-file -p HEAD^{tree}    # prints the tree at HEAD
git cat-file -p HEAD:src/main.c  # prints the blob for a specific file

How a commit is a snapshot, not a diff

A key misconception: Git stores snapshots, not diffs. When you commit a change to one file in a repository with 500 files, Git creates a new blob for the changed file, a new tree object for every directory between that file and the root (because tree objects reference their children by SHA-1 and the child's SHA-1 has changed), and a new commit object pointing to the new root tree. The 499 unchanged files are still referenced by the new trees — they point to the same blob objects as before, with no copying. The unchanged portions are shared.

The git diff output you see is computed on the fly by comparing the two blob objects. It is not stored. This means computing the diff between two arbitrary commits requires fetching both trees and comparing their blobs, which takes roughly linear time in the number of changed files. For very large repositories with many changed files, that comparison can be slow, which is why operations like git log -p on a large revision range take time proportional to the range being rendered.

Branches are just pointers

A branch in Git is a file in .git/refs/heads/ containing a single 40-character SHA-1. That is the entire implementation. main is a file whose contents are the SHA-1 of the most recent commit on that branch. Creating a branch allocates no new objects; it creates a 41-byte file. Deleting a branch deletes that file. The commit objects the branch pointed to are still in the object store; they will be garbage-collected only if no other reference (branch, tag, or the reflog) reaches them.

cat .git/refs/heads/main
# 3a8f2c1d9e4b7a0f6c5d2e8b1a9f3e7d5c2b0a4f

cat .git/HEAD
# ref: refs/heads/main

HEAD normally contains the name of the current branch, not a SHA-1. When HEAD contains a SHA-1 directly (as it does after git checkout <sha1>), you are in detached HEAD state. No branch points at subsequent commits you create; they will be orphaned when you switch away. This is not dangerous if you know it is happening: git branch new-branch from a detached HEAD creates a branch at the current commit before you lose it.

What merge and rebase do to the object graph

A merge commit is a commit with two parent pointers. The history graph is a directed acyclic graph (DAG); a merge commit is a node with in-degree two (it has two parents). The merge commit records that the histories of both parents are now unified, and its tree object is the result of combining the two parents' trees. The history of both branches is preserved in full.

Rebase rewrites commits. Given a feature branch with commits A→B→C on top of main, rebasing the branch onto an updated main creates new commits A'→B'→C' whose parent pointers, timestamps, and (potentially) content differ from the originals. The original commits A, B, C still exist in the object store and in the reflog, but the branch pointer now points to C'. No history is destroyed — but it is rewritten, which is why force-pushing a rebased branch to a shared remote causes problems for anyone who had the old commits: their history has a different graph structure than the new one.

The reflog: Git's safety net

Every time a branch pointer or HEAD moves, Git records the old and new SHA-1 in a per-branch reflog file (.git/logs/refs/heads/branchname). The reflog is local and not pushed or fetched. It is the safety net for "destructive" operations.

git reflog             # shows HEAD's movement history
git reflog show main   # shows main branch pointer history

If you accidentally delete a branch, lose commits to a bad reset, or rebase away commits you wanted to keep, the reflog shows the SHA-1 of where HEAD was before. You can recover the commits with git checkout <sha1> or git branch recovered-branch <sha1>. Reflog entries expire after 90 days by default (30 days for unreachable commits), so recovery is time-limited but usually feasible within a working session.

Pack files: how Git handles large repositories

Loose objects in .git/objects/ accumulate over time. Git periodically runs git gc (garbage collection) to pack them into binary pack files stored in .git/objects/pack/. A pack file stores objects compressed and delta-encoded: similar objects (like successive versions of the same file) are stored as a base plus a delta, reducing storage dramatically. A 1 GB loose object store often packs down to tens of megabytes for a typical source repository.

Understanding pack files explains why git clone --depth 1 produces a tiny download: Git constructs a pack file containing only the tree and blobs reachable from the single requested commit, omitting all history. A shallow clone loses the ability to compute diffs against older commits or run git blame on lines changed before the depth cutoff, but it is fine for CI builds that only need the current source tree.

The object model also explains why git log is fast even on large repositories with long histories: traversing the commit graph is pure pointer chasing in the pack file, which fits in memory and cache. The slow operations are those that require inspecting blob contents at many commits — git log -S pattern (pickaxe search) or git blame on a deeply modified file — because they must unpack and compare file contents across many commits.