Pages

GIT: Chapter 8 — Git Internals (Objects, SHA, Storage Model)

Chapter 8 — Git Internals (Objects, SHA, Storage Model)


8.1 Introduction

Understanding Git internals transforms Git from a command-based tool into a deterministic data model. Git is fundamentally a content-addressable filesystem combined with a versioned directed acyclic graph (DAG).

This chapter explores:

  • Git object model

  • SHA-based identity system

  • Object storage structure

  • Commit graph representation

  • Packfiles and compression

  • Plumbing commands

These concepts clarify why Git operations are fast, reliable, and cryptographically verifiable.


8.2 Git as a Content-Addressable Store

Git does not store files by name or location. Instead, it stores content objects indexed by cryptographic hashes.

Key properties:

  • Immutable storage

  • Deduplication

  • Integrity verification

  • Referential graph

If content changes, hash changes.


8.3 SHA Hashing in Git

Git historically uses SHA-1 hashes (transition toward SHA-256 in newer implementations).

SHA characteristics

  • Fixed-length digest

  • Collision-resistant (practically)

  • Content-derived identity

Example hash:

e83c5163316f89bfbde7d9ab23ca2e25604af290

This hash uniquely identifies a Git object.


8.4 Git Object Types

Git uses four primary object types:

ObjectPurpose
BlobFile content
TreeDirectory structure
CommitSnapshot metadata
TagAnnotated reference

These objects form the Git storage backbone.


8.5 Blob Objects

Blob = Binary Large Object.

Stores:

  • Raw file content

  • No filename

  • No metadata

Creation example:

echo "Hello Git" | git hash-object -w --stdin

Git:

  1. Computes SHA

  2. Compresses content

  3. Stores in object database


8.6 Tree Objects

Tree objects represent directories.

Contain:

  • Blob references

  • Tree references

  • Filenames

  • Permissions

Conceptually similar to a filesystem directory listing.

Inspect:

git cat-file -p tree-hash

8.7 Commit Objects

Commit objects define repository snapshots.

Contain:

  • Root tree pointer

  • Parent commit(s)

  • Author/committer metadata

  • Timestamp

  • Commit message

Commit graph structure creates project history.

Inspect:

git cat-file -p commit-hash

8.8 Tag Objects

Tags provide named references to commits.

Types:

Lightweight

  • Simple pointer

Annotated

  • Metadata

  • Signature

  • Message

Create annotated tag:

git tag -a v1.0 -m "release"

8.9 Object Storage Layout

Git stores objects inside:

.git/objects/

Structure:

.git/objects/ab/cdef123...

Directory name = first 2 hash characters
Filename = remaining characters

This improves filesystem scalability.


8.10 Object Compression

Objects are stored using zlib compression.

Benefits:

  • Reduced disk usage

  • Efficient cloning

  • Network optimization

Git transparently decompresses during retrieval.


8.11 The Commit DAG

Git history forms a Directed Acyclic Graph.

Properties:

  • Directed parent relationships

  • No cycles

  • Multiple parents allowed (merge commits)

Example:

A → B → C

D → E

Merge produces multi-parent commit.


8.12 HEAD, Refs, and Pointers

Git uses reference files:

.git/refs/heads/main

These store commit hashes.

Key references:

  • HEAD → current branch

  • Branch refs → commit pointers

  • Tag refs → tagged commits

Inspect HEAD:

cat .git/HEAD

8.13 Detached HEAD Internals

When HEAD references a commit instead of branch:

HEAD → commit-hash

No branch pointer movement occurs.

Commits may become unreachable.


8.14 Index (Staging Area) Internals

The index is an intermediate snapshot.

Stored in:

.git/index

Functions:

  • Tracks staged content

  • Enables partial commits

  • Accelerates status comparisons

Conceptually:

Working → Index → Commit

8.15 Packfiles

Loose objects are inefficient at scale.

Git compresses objects into packfiles.

Location:

.git/objects/pack/

Advantages:

  • Delta compression

  • Network transfer optimization

  • Storage efficiency

Create packfile:

git gc

8.16 Delta Compression

Git stores object differences rather than full copies.

Example:

  • Version 1 → base object

  • Version 2 → delta

Reduces redundancy significantly.


8.17 Garbage Collection

Git periodically removes unreachable objects.

Run manually:

git gc

Operations:

  • Pack objects

  • Prune unreachable

  • Optimize repository


8.18 Reachability Concept

Objects are retained if reachable from:

  • Branches

  • Tags

  • Reflog

Unreachable objects become garbage.


8.19 Reflog Internals

Reflog tracks reference movement.

Location:

.git/logs/

Allows recovery of:

  • Deleted commits

  • Branch resets

  • Detached states

View reflog:

git reflog

8.20 Plumbing vs Porcelain

Porcelain commands

  • User-friendly

  • Example: commit, push

Plumbing commands

  • Low-level internals

  • Example: hash-object, cat-file

Used for scripting and debugging.


8.21 Writing Objects Manually

Create blob:

git hash-object -w file.txt

Read object:

git cat-file -p hash

List tree:

git ls-tree hash

These expose Git’s storage layer.


8.22 Data Integrity Guarantees

Git ensures:

  • Tamper detection

  • Snapshot immutability

  • Historical traceability

If object changes → hash mismatch → corruption detected.


8.23 Performance Implications

Git internals enable:

  • Fast branching

  • Cheap cloning

  • Efficient merging

  • Scalable history

Because Git manipulates pointers rather than files.


8.24 Practical Mental Model

Think of Git as:

Key-value database + DAG + working directory

Where:

  • Key = SHA

  • Value = object content


8.25 Summary

This chapter examined:

  • Content-addressable storage model

  • SHA identity mechanism

  • Blob, tree, commit, and tag objects

  • Object storage structure

  • Commit DAG

  • Index architecture

  • Packfiles and delta compression

  • Reachability and garbage collection

  • Plumbing command ecosystem

Understanding Git internals provides conceptual clarity for advanced workflows, debugging, repository recovery, and performance optimization.

No comments:

Post a Comment