git under the hood
git under the hood: what happens when you switch branches
As developers, almost all of us use git. And we all know how to use it: add, commit, push, checkout… these
commands are part of our daily routine.
Like most developers, I had no idea how git works internally, what actually happens under the hood.
For my part, I wanted to learn more, and I did so by reading the git Internals chapter of the git Book. I’d like to share with you what I learned.
I propose we explore this by starting from a very simple and very common case: switching branches. We’re going to pull on the thread of everything that happens under the hood to make the magic work.
Because it is a little bit magical, isn’t it? When you switch between two very different branches, git manages to change the contents of your entire working directory almost instantly, while mathematically guaranteeing that the state of the whole project is exactly the same as it is for anyone else on that same branch. How is that possible?
The scenario: git switch main
It all starts with HEAD
I’m on my my_feature branch and I switch to main. What actually happens?
The first link in the chain: a small file named HEAD changes its value. This file sits at the root of my .git
folder. It’s a very simple file, containing only one line and storing only one thing: the location of the file that
represents my current local branch.
In our case, the value of HEAD simply goes from:
ref: refs/heads/my_feature
to:
ref: refs/heads/main
That’s it. Switching branches, from HEAD’s point of view, means rewriting the single line contained in a text file.
The content is the path of the file that represents our branch.
(There’s a special case, the detached HEAD, where HEAD directly contains the fingerprint of a commit instead of a
branch reference. We’ll set that aside here.)
TL;DR
HEAD points to a branch.
A branch is just a pointer
And now, as promised, we’re going to pull on the thread and look together at what the file located at refs/heads/main
contains.
The file that represents my branch (refs/heads/main) is, in turn, a text file containing only one line and storing
only one value: the fingerprint of a commit, which looks something like this:
9370f3ba267552ccca4a5d8870793fe8f6b6e7d2
That’s right: in git, a branch is just a pointer to a commit. Nothing more, nothing less. That’s exactly why creating a branch is instant: it just means writing 40 characters into a new file.
TL;DR
A branch points to a commit.
What is a commit?
Now we’ll see where this fingerprint comes from, and above all what it’s generated from.
A commit is… a text file. A very simple object that contains:
- the commit message (sometimes in two parts: a title then a description);
- the author, with their email and a timestamp;
- the committer, with their email and a timestamp;
- the fingerprint of a “root” tree object, which represents the root folder of my project;
- one or more fingerprints of parent commits.
It looks like this:
tree af4a73f4f11f01ccd3528098bd0b7d1fe9887c20
parent 3783b39390accada85eb019477888af0d086ed54
author jeromeschwaederle <an.email@gmail.com> 1770672003 +0100
committer jeromeschwaederle <an.email@gmail.com> 1770672003 +0100
A commit message
You can also add a longer description.
You can see this for yourself with one of git’s “plumbing” commands, git cat-file:
$ git cat-file -p 9370f3ba267552ccca4a5d8870793fe8f6b6e7d2
Crucial point: the fingerprint is computed from the contents of this text file. As a result, if a single character were to change (a comma in the message or a second in the timestamp) the fingerprint would be completely different.
TL;DR
The commit is a small text file that associates metadata with a root “tree”.
The tree: the snapshot of your project
Let’s keep pulling on the thread. The commit points to a tree, but what is that?
A tree object is, like a commit, a small text file.
It solves the problem of storing file names and of grouping several files together.
A tree contains one or more entries. Each entry associates a mode, a type, a SHA-1 fingerprint, and a file (or folder) name:
$ git cat-file -p af4a73f4f11f01ccd3528098bd0b7d1fe9887c20
100644 blob a906cb2a4a904a152e80877d4088654daad0c859 README.md
100644 blob 8f94139338f9404f26296befa88755fc2598c289 aFile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 aFolder
Two interesting things show up here:
- The mode looks like UNIX permissions, but git only uses a handful of them:
100644for a normal file,100755for an executable,120000for a symbolic link, and040000for… a sub-folder. - The last entry,
aFolder, is not ablobbut a pointer to another tree. In other words, a tree can contain sub-trees. It’s recursive, exactly like a folder hierarchy.
So a commit points to a root tree, which points to sub-trees and blobs. We have a complete hierarchy of the project’s state at a given moment. A commit is therefore a snapshot of your entire project, not a “diff”.
TL;DR
The tree is a small text file that represents a folder. It associates a name with a content.
The blob: content, and nothing but content
One last link remains: what is a blob?
The blob stores only one thing: the contents of a file. Not its name (the tree takes care of that), not its date, not its permissions. Just the bytes.
$ git cat-file -p a906cb2a4a904a152e80877d4088654daad0c859
README
This is the content of a README.
How does git build a blob’s fingerprint? It takes the content, adds a small header (blob, followed by the size in
bytes, followed by a null byte), then computes the SHA-1 of the whole thing:
header = "blob 18\0"
everything to be hashed = "blob 18\0git under the hood"
sha1 = a61bbe0932fe6dde2d8f11348c5e26c076ed9191
echo -en "blob 18\0git under the hood" | sha1sum
a61bbe0932fe6dde2d8f11348c5e26c076ed9191 -
produces the same signature as
echo -n "git under the hood" | git hash-object --stdin
a61bbe0932fe6dde2d8f11348c5e26c076ed9191
The result is then compressed with zlib and written to disk in .git/objects/. The sub-folder name corresponds to the
first 2 characters of the fingerprint, and the file name to the remaining 38:
.git/objects/a6/1bbe0932fe6dde2d8f11348c5e26c076ed9191
And, importantly, all of git’s objects (blobs, trees, commits) are stored in exactly the same way. Only the header
changes: it starts with blob, tree or commit depending on the case.
Pulling back up the thread: git is a content-addressable filesystem
We’ve gone all the way down. Let’s recap the chain we just walked through:
HEAD → refs/heads/main → commit → tree → (sub-trees) → blobs
git is, fundamentally, a content-addressable filesystem, with a version-control interface built on top of it.
What does that mean concretely? That the core of git is a simple key-value store. You give it some content, and it gives you back a key (the fingerprint) that you can later use to retrieve that content. The key is derived from the content.
That’s also why we talk about a repository (a store, a warehouse): it’s nothing more than a content store. The
.git/objects directory is your project’s database.
This very simple idea has enormous consequences:
- Deduplication is automatic. Two files with identical content, even with different names and in different folders, produce the same blob and are stored only once. The same goes between two commits: if a file doesn’t change, its blob is reused as is.
- Integrity is guaranteed by construction. A commit’s fingerprint depends on its tree, which depends on its sub-trees, which depend on their blobs. Everything is hashed in a cascade. It’s impossible to change a single byte in the history without changing every fingerprint all the way up to the commit. It is a Merkle tree. When two people have the same commit hash, they mathematically have the same project, byte for byte.
The magic of a branch switch, in summary
We can finally answer the question from the beginning. When I run git switch main, git:
- rewrites
HEADto point torefs/heads/main; - reads the commit fingerprint stored in that file;
- reads the root tree of that commit;
- updates the index (the staging area) and materializes this hierarchy in my working directory.
And it’s fast because git has almost nothing to “compute”: everything is already stored, ready to use, indexed by its fingerprint. It only has to follow the pointers and copy over the relevant blobs. The integrity guarantee, for its part, comes for free: it follows directly from the fact that each object is identified by the hash of its content.
Conclusion
git is simply a content store addressable by fingerprint, on top of which a few pointers (the branches, HEAD) and a
bit of metadata (the commits) have been laid.
It’s no accident that in the very first version of git, written by Linus Torvalds, he describes it himself as a “stupid content tracker”. “Stupid” in the noble sense: it doesn’t try to be clever, it simply stores content and retrieves it by its fingerprint. And it’s precisely this simplicity that makes it so robust.
The next time you switch branches, you’ll know exactly which thread you’re pulling on.