Git under the hood (minimal)
Git is a content-addressable, immutable, distributed database optimized for tracking filesystem snapshots.
Core ideas: - Everything is identified by a cryptographic hash (SHA-1 or SHA-256) - Data is immutable: new content ⇒ new hash ⇒ new object - Commits store complete tree snapshots, not diffs - History forms a Merkle DAG of commits
Main object types: - blob → raw file content - tree → directory (maps filenames to blob/tree hashes) - commit → metadata + pointer to a tree + parent commit(s) - tag → named reference to an object (usually a commit)
Storage layout:
- Objects live under .git/objects/<2-char>/<38-char>
- Loose objects are zlib-compressed individually
- Packfiles group and delta-compress objects for efficiency
Hashes and integrity: - Object id = hash(content) - Commit includes hash(tree) and hash(parent) - Chain of hashes = tamper-evident Merkle DAG
References:
- .git/refs/heads/<branch> → latest commit hash
- .git/refs/tags/<tag> → tagged commit
- HEAD → current branch ref (symbolic)
- Detached HEAD → points directly to a commit
Index (staging area):
- .git/index maps paths → blob hashes + metadata
- Bridge between working directory and next commit
- Enables three-way diff: working dir, index, HEAD
Graph model: - Commits form a DAG: node = commit, edge = parent - Merge = commit with multiple parents - Rebase = rewrite DAG by creating new commits
Remotes: - Remote = peer repository (not a master) - Fetch/push sync missing objects by comparing hashes - Transfer is delta-efficient and stateless
Architectural patterns: - Content-addressable storage → object immutability - Composite → trees containing blobs/subtrees - Merkle DAG → commit integrity and verification - Symbolic references → HEAD and branches - Staging buffer → index as write cache - Eventual consistency → decentralized sync
Mental model: Git = immutable key-value store + DAG of snapshots + symbolic refs.