The Seamy Underside of Git

by Steven J. Owens (unless otherwise attributed)

This is a rough draft of a work in progress. I am learning git, and writing this as I go, to help me organize and understand it. There are probably errors. If you want to point out errors or misconceptions, please do :-). But remember that this is written by a git beginner, and more importantly, written for a git beginner. (Also, like any git tutorial this thing really should have a bunch of pretty graph diagrams.)

Introduction

This tutorial is about how git is put together under the hood, but it is not a practical tutorial. There are lots of fine practical tutorials out there (I recommend some in the appendices at the end) but this tutorial explains git by explaining the underlying mechanisms, and building up the larger git concepts from those underlying mechanisms. Which is pretty much the way git actually works, so I think it makes more sense to explain it that way. I know it makes more sense for me.

I will recommend reading one thing before reading this; read Tom Preston-Werner's "The Git Parable". It's quick and easy-going read, and I think it'll help you get a feel for the underlying logic of git:The Git Parable

The rest of this introduction is a bit of a roadmap of the tutorial, so you can skip right ahead and dive into the gory details, if you like.

First I'm going to talk about where you can find the list of URLs that I'm referring to (and some I don't refer to, but are good reads anyway), then terminology, and where the example code is from.

Then we'll start with a really quick conceptual overview, before we dive into the foundation of git, "The Objectstore and SHA-1 Objectnames".

Then we'll look at the individual "Object Files and Object Types", and see how git uses them as building blocks to implement the larger git concepts, like commits.

Then we'll get into the other scaffolding that git has, like the index, references (aka lightweight tags, aka not-objetstore-based-annotated-tags), symbolic references (aka HEAD), and so on, and so on.

Nuts & Bolts

I'm going to assume you're basically competent with a shell environment and that you can use any of the fine tutorials listed in the appendix "Some Starter Git Tutorials" to get git installed and working, and learn the basics of git commands.

I'll try to make sure I list all reference URLs from the body of this article in the appendix "Links About Git Internals".

Go to the appendix titled "Example Sources" to see how to download the code used in the examples, so you can follow along.

Terminology Inconsistency

There seems to be a lot of inconsistent, overloaded and sometimes ambiguous or downright contradictory use of terminology in the git world. This may be because git has evolved rapidly. It may be because people made some awkward choices about terminology. A good chunk of it is the classic clash between "how" terminology and "what" terminology.

However it happened, I'm going to try to stick to a consistent set of terminology, and since I haven't yet found a canonical definition of git terms, I'm just going to have to follow my own judgement there. Obviously I'll try to stick to common usage, as far as I can tell what that is, but as a rule of thumb I'm going to lean towards using the terms from the internal structure of git.

Conceptual Overview

Unlike CVS or SVN, where "the repo" is the central server, and users have to checkout copies, in git's world each user's "checkout" is in fact a full copy of all the data and is a fully functioning "repo". Technically speaking, no repo is inherently more important than any other, that's what makes git fully distributed and fully decentralized. Practically speaking, people usually set up some special repo instances to make it easier to coordinate things (the repo instances are just the same as all other repos, it's how people treat and use them that makes them special). See "The Git Server", below.

The files you directly edit with your favorite text editor (emacs if you're sane, vi if you're in league with Satan) are called the "working tree" or "working files" or "working copy" or "working directory", or "working filetree". I'm going to call it the working tree. The working tree is basically everything EXCEPT the .git subdirectory.

Unlike in SVN or CVS, there is only a single .git subdirectory per set of files; it is in the top directory of your working tree. Strictly speaking, the .git subdirectory and contents are the git repo. The .git subdirectory contains the history, and also the index/staging/cache area.

The index in particular seems to have a lot of names (the index, the cache, the directory cache, the current directory cache, the staging area, the staged files), but it's contained in the file .git/index, so I will always refer to it as the index. Linus has reportedly said that if you don't understand the index, you don't understand git, so it's going to be important to understand it.

Note that when I say the index is "contained", the file .git/index contains the metadata for the index, but the actual data, i.e. file contents for the files listed in the index, are in the files under .git/objects/*, aka the objectstore. More on this below.

In many other version control systems "commit" is a verb (i.e. commit the changes), and sometimes people use it as a noun too, i.e. saying "that commit" to refer to a set of changes. People use the word commit the same way in git, but also, in git a commit is an actual thing, a distinct object in the .git object store.

That commit object in turn contains SHA-1 hash values that identify other git objects, which in turn contain values that identify others - a whole tree of objects. Properly speaking you might call that a commit tree, but in common usage people tend to just shorten it to "commit". This, of course, makes the language when somebody refers to a commit ambiguous; you don't know whether "commit" refers to the entire set of changes, to the commit tree in the git object store, or to the commit object at the top of the commit tree. You have to infer from context.

The Git Server

So if every user has their own fully functional repo, is there NO server or centralized piece? Well, yeah, but only kinda.

It's a common convention to set up a "server" repo. This is a repo like any other, it has no special features, other than the fact that it's usually installed on a machine that is always online, so it's always available to all of the users. It has no special authority inherently, only the fact that everbody working together on a project agrees to push changes to it and pull changes from it.

When I said it has no special features, I lied slightly.

First, a server repo is typically a bare repo, which is basically the .git directory without the working tree. See "Bare Repos" below.

Second, a server repo can be just another set of files on the local file system or NFS-mounted drives, or can be accessed over ssh, in which case, yes, it's just a git repo almost identical to the .git directory in your git repo. But there are also additional applications, specialized http servers, to make it easier to manage giving multiple people access to the repo without giving them accounts on the server machine. Gitolite, Gitorious, GitBlit, Gitlab, Github, etc.

My impression is that the special http server approach seems to be the most popular, but the fine folks on freenode #git tell me that using git over ssh is more popular.

Twice now, I've set up a git repo on a server, accessing it via git-over-ssh (without gitolite or anything else). File permissions were tricky and a pain. In those cases it was just myself and another dev, who I trusted, so I just created a new user account that entirely owned the git repo, and we both shared the password for that account. While this is doable, obviously it's not a good idea as a general practice.

Note: What I didn't know at the time is that I could have set up that shared account to only run a special git shell (named, appropriately enough, "git-shell") that only allows users to do git commands, which would make this slightly safer/saner.

Note: The git world often uses remote as short-hand for "remote repo", in part because a "remote" is actually a thing in the git internals, specifically a human-friendly alias or shortcut for the full URL for a remote repo.

Bare Repos

The server repo is usually a bare repo, since nobody's actually working on stuff locally on it.

A lot of tutorials and such get sidetracked talking about "bare" repos early on. I think they want to get it in and get it out of the way early, which is fine, I can understand that impulse. But it tends to overemphasize it too much. So now I'm overemphasizing it in order to defuse that overemphasis. Such is life.

The point is that the .git subdirectory IS the repo, and is often referred to as "the repo", (though sometimes people are sloppy and use "repo" in a way that seems to refer to both the working tree and the .git). However, unlike CVS or SVN you don't need to have the working tree at all, since you're not actually messing with the files locally. The repo, contained in the .git directory all by itself, is sufficient to push changes in and pull changes out.

The main use for a bare repo is for a) emailing around sets of changes as bare repos and b) setting up a central server git repo where there's no everyday user, so there doesn't need to be a working tree.

A Quick List

There seems to be a pattern, in writing about this, where I repeatedly try to summarize things, and repeatedly realize that I need to summarize my summary. There's a clue to the fundamental nature of git there.

Further down, in "The Implications of SHA-1 objectnames", I mention that part of what makes git so slippery is that where other systems have explicitly defined behaviors, in git much of the substance arises natrually from the elegant combination of simple, core concepts. Until you realize this and stop looking for an explicitly defined structure at the heart of git, you'll keep tripping over your own preconceptions.

As I read the rest of this, I realize it's way too easy to get lost in the details. This is still a work in progress; someday I'll get to the point where I can rewrite it and have a concise introduction. In the meantime, I'm going to give a quick list of the different elements of git. To be clear, these are literally the things that make up git, the things that the git implementation is built from.

Big Concepts
- the .git directory, contains the repo
- the working tree, the files you edit
- the index, it mediates between the objectstore and the working tree.
Underlying Concepts
- .git/objects is the objectstore
  - blob objects, which contain source content
  - tag objects, which point to commit object
  - commit objects, which point to a tree object
  - tree objects, which point to other trees and to blobs
- .git/index
Meta-meta-data
- .git/refs contains files that define aliases, or nicknames for other objects (mainly branches and commits)
  - .git/refs/tags, aka references
  - .git/refs/heads contains HEAD values for branches you don't have checked out.
  - .git/refs/remotes like a tag except the file contains a path to a remote repo
- symbolic references are sort of like references, except they're defined by git internally, not by you.
  - .git/HEAD always points to the most recent commit object of the branch you currently have checked out.
  - .git/branches is a lot like .git/ref/tags, but the files point to branches, not commits.

The Typical Git Workflow

I have a section down near the end titled "The Typical Git Workflow". You might find it helpful to skip down there, read that section to get a feel for how git works, then resume here. I originally had that here, but I decided it was getting in the way of getting down to the guts of git.

The Objectstore and SHA-1 Objectnames

Nearly everything in git is identified by an SHA-1 hash value. The SHA-1 hash is 160 bits, but it's displayed as a 40-character hexadecimal string, for example "254de922a6921e1b1745529aa97abff92bb14ef7".

This comes up a lot, because you often see these hex-strings in git command output, or have to type them into git commands as parameters. As it turns out, understanding where and why we're using these SHA-1s is really useful for understanding git, so we're going to talk about them here. And they're so cool, I'm also going to have another whole section, later, about "The Implications of SHA-1 objectnames" in git's design. You'll love it, it's a way of life.

For typing them as parameters, don't worry; in practice 7 characters of the SHA-1 hex string are almost always sufficient to uniquely identify something from the command line, and if it's not, git will tell you.

If you don't know what an SHA-1 hash is, go look it up now, and then come back here. The key point is that the SHA-1 hash is a cryptographically strong checksum; a way of looking at a chunk of data and calculating a number that uniquely identifies that chunk of data (or close enough that you're more likely to get struck by lightning than accidentally come up with the same number for two different chunks of data).

Note: SHA-1 is the name of the hash function, so it's "the SHA-1 hash function", and the value it produces is "the SHA-1 hash value" or just as often "the SHA-1 hash". In the rest of this document I'm just going to shorten that to "the SHA-1" most of the time. That seems to be what people do, most of the time.

.git/objects contains the git objectstore. This is a simple directory/file-based objectstore, i.e. a "content-addressable filesystem".

If you're familiar with the Maildir format, .git/objects is kind of like Maildir, only it uses cryptographic hashes for the filenames, so it's cooler (and more powerful, too).

You can think of .git/objects as a key/value store. In simple terms, it's a directory with a whole bunch of unique file names, where the file name is chosen for you by git. The filename is the key and the contents of the file are the value. The key is the SHA-1 hash that git calculates based on the contents. I said it's a directory full of files, but it's slightly more complicated. To avoid having a single directory with a brazillion files, which would slow down file system access, git groups the files in subdirectories named for the first two characters of the SHA-1 hashes.

I'm going to call this the objectstore, but I'm also going to frequently refer to it as ".git/objects", to emphasize what's really going on at a nuts & bolts level.

Each "object" in the git repo is stored as a file under under .git/objects.

The file name is the SHA-1 hash value of the file contents.

Each file/object under .git/objects is referenced throughout the rest of git by that SHA-1 hash, a 160-bit value written to as a 40-byte hexadecimal representation.

The git world seems to have a variety of names for a SHA-1 value used as a reference this way, but we'll call that an objectname throughout the rest of this document, to avoid confusion. (Git has another thing called a reference, so we can't call the SHA-1 values references.)

To avoid having a single directory with a massive number of files in it, the files are bucketed into subdirectories, each subdirectory named by the first two characters of the filenames it contains.

For example, let's say you had three SHA-1 filenames:

  254de922a6921e1b1745529aa97abff92bb14ef7
  253f8c6f60deda93979470cba44d92c8b06095cd
  7d3e0ed3885686589f0137109f46acc8faf9e9c4

Git would create two subdirectories under .git/objects, named 25, 7d:

  .git/objects/25/4de922a6921e1b1745529aa97abff92bb14ef7
  .git/objects/25/3f8c6f60deda93979470cba44d92c8b06095cd
  .git/objects/7d/3e0ed3885686589f0137109f46acc8faf9e9c4

Notice that git lops off the subdirectory characters from the file name. Personally I wouldn't have, but then I'm not Linus.

Note: Like I said above, git is pretty smart about objectnames. In my examples I'm going to copy and paste the entire objectname, but you don't have to. You only need enough characters to uniquely identify the objectname. Usually 7 characters is enough.

Object Files and Object Types

Each file under .git/objects contains compressed data (for example code source).

The first chunk of the compressed data is type and size in a null-terminated (\0) string.

There are four types of objects: blobs, trees, commits and tags (see Object Types below).

If you think about it, storing the type/size as part of the compressed data sorta doesn't make sense, unless it's really cheap/easy to decompress the first few bytes and get those details. OR, if data is cached somewhere else. As near as I can tell, it's mostly the second case. I'll explain that further down, after I explain the git object types.

Objectstore Loose vs. Packed Format

The above describes the objects in loose format, the default. There's also packed.

Loose is a pretty brute-force approach. When a file's contents change, so does the SHA-1, so a new file is created under .git/objects, named with the new SHA-1. The old file just hangs around. Eventually you have dozens of copies of a given file, one for each version that ever existed.

Packed is done as a sort of garbage collection process. In fact there's git command to do it, named "git gc". The "git gc" command reorganizes the brazillions of objectstore files so that for each working tree file there's a single full copy, the latest one, and then a bunch of deltas (or diffs) for all of the previous versions. Then it sticks them all in one file.

It looks like packed doesn't get used until you get a large repo, so let's skip it for now.

Note: In the appendix "Example Sources", I explain how to unpack a pack file, so you can look at the objectstore entries individually.

Note: Git gc will eventually be run by other git commands, based on the git configuration variable gc.auto (see the git gc man page link at git-scm.com, below). From one comment on this stackoverflow question: "From a quick survey of the source: merge, receive-pack, am, rebase --interactive, and svn call gc --auto directly. That's not a complete list, though, since other commands may call those." http://stackoverflow.com/questions/3532740/do-i-ever-need-to-run-git-gc-on-a-bare-repo/3533073#3533073

Further info at:

http://git-scm.com/docs/git-gc http://schacon.github.io/gitbook/7_the_packfile.html https://www.kernel.org/pub/software/scm/git/docs/technical/pack-format.txt

Object Types

Each file (or object) is one of four types: blob, tree, commit or tag.

I'm going to summarize these four types briefly, then provide examples and more details.

Git blob files contain binary data, usually compressed contents of files from the working tree (e.g. code source, etc). Sometimes blobs contain deltas of other blobs (see packed file format).

A tree file under .git/objects contains a list of entries, alphabetical by filename. Each tree entry can be an objectname to another tree file, which is how it builds up the tree structure, or an objectname to a blob file (i.e. a source file).

In "git ls-tree" or "git cat-file -p" output, each tree entry line is:

a mode (6-digit number, first 3 for type, second 3 for perms)
an object type string (tree or blob)
an objectname (SHA-1 value, 40 character hexadecimal)
a name string (the file name or subdirectory name)

Note: The above is the ls-tree output, but if you dump the tree file using the python script example below (see "Using Python to Decompress a commit Object") you'll see that the tree file doesn't actually contain an object type string. The object type is determined from the mode numbers. A mode number beginning with 1 means the SHA-1 hash points to a blob object; a mode number beginning with 4 means the SHA-1 hash points to a tree object.

The most appropriate name for a tree object, really, is "directory", since that's what the tree object is modeling. Like a directory, a git tree object contains a list of entries, each of which describes another file, either a data file or another directory (or in git's case, tree) file. If tree object is git's equivalent of the unix directory, the SHA-1 is equivalent to the unix inode that points to the file.

However, I can see why calling them "directory" objects might be confusing, so I guess we can't call them git directory objects. The name "tree" is both ambiguous and misleading, but it look like we're stuck with it. I'll try to consistently use the phrases "tree object" and "commit tree" to distinguish between a git tree object and a full tree-of-trees-and-blobs.

Note: "branch node", as opposed to "leaf node" (which is what the blobs are in a git commit tree), might be more accurate, but of course then we get into conflict with git branch terminology.

Note: One of the gotchas of git is that to get git to notice a subdirectory, you have to put a file in it. According to the git wiki this is because the git index doesn't handle directories, only files. Therefore directories only get included in commit trees by implication - when they contain a file that needs to be included in the commit tree. The wiki FAQ implies this does not have to be this way, but "nobody competent enough to make the change to allow empty directories has cared enough about this situation to remedy it."

https://git.wiki.kernel.org/index.php/GitFaq#Can_I_add_empty_directories.3F

A tag is just a way to define a human-readable alias or shortcut that points to an objectname. A tag object contains:

objectname (SHA-1 value)
an object type string
a tag name string
the tagger name string
a message (which may contain a PGP signature).

Tags "generally" point to a commit, according to Pro Git by Scott Chacon and Ben Straub, but in theory could point to any objectname.

A commit represents the starting point (the root node, in a local sense) of a snapshot of your project at the time of commit. What "snapshot of your project" actually means is a little tricky, so I'm going to give commits a whole section of their own, further down. For now, a commit file in the objectstore contains:

a tree objectname (SHA-1 value)
a parent commit objectname (SHA-1 value)
an author name string with timestamp and timezone
a committer name string with timestamp and timezone
a commit message string

I'll get more deeply into commits, further below.

Note: See "The Object Database" section of chapter 7 in the git user's manual for more specifics on types:

https://www.kernel.org/pub/software/scm/git/docs/user-manual.html#git-concepts

Git Commit Message Conventions

The git commit message is a human-readable line of text that describes the commit. There are all sorts of tutorials and discussions about the "right" or "best" way to write a git commit message. Here's one that people seem to like:

http://chris.beams.io/posts/git-commit/

However, not all git commit message text is meant solely for human consumption.

There are at least two formalized messages, call them message flags: "squash!" (often called autosquash because "--autosquash" is the git rebase command line parameter relevant to it) and "fixup!". These message flags can be automatically interpreted by the git rebase command, when you use rebase interactively. For more on this, see "Git Squash And Fixup" further down, (which will also briefly discuss the rebase command).

Commit-ish and Tree-ish Objects

The above are the four object store types, but while we're at it, I should give you a heads up that sooner or later in the git docs you're going to run into "tree-ish" and "commit-ish".

Don't worry about them right now, but there are several git commands that want, for example, a tree parameter, but will also accept a commit or a tag and "do something useful", i.e. parse them and get to the obvious tree associated with them. The git docs refer to these parameters as "tree-ish".

There are similar commands that want a commit parameter but will accept other parameters and find the obvious commit associated with them. The git docs refer to these parameters as "commit-ish".

See:

https://www.kernel.org/pub/software/scm/git/docs/#_identifier_terminology http://stackoverflow.com/questions/23303549/what-are-commit-ish-and-tree-ish-in-git http://stackoverflow.com/questions/4044368/what-does-tree-ish-mean-in-git

Who's Keeping Track?

Now that you've seen tree objects and commit objects, it's time to answer the question I asked up in "Object Files and Object Types".

Most of the time, by the time git gets to a file in .git/objects, git already knows what it's looking for. Git follows a commit to the top tree object in a commit tree, and from the tree objects it gets the objectnames for the blobs. By the time git gets to the blob file, it's reading it solely for the purpose of getting the file contents, either to do a diff or to load the file contents into your working tree.

Note: I'll explain how git found the commit objectname further down, when we get into detail on commits (and on branches, and HEAD).

The .git/index file gets involved, too. Git uses the index to generate the commit trees. When you use a command like "git branch", git loads a commit tree into the working tree - it copies over the files in your working tree directories, renames, removes files, etc.

When it does all this, git also loads the index with a set of metadata to match: one entry in index for each file in the working tree. That set of entries is also what git uses to look at your changes, and to generate the new commit tree when you make your next commit. I'll get into this more, further down, in the section on the index.

ObjectStore Type Examples

Use "git log" to list the commits in flask-tracking and pick a commit to look at.

puff@redbitter:~/git/flask-tracking$ git log
[...excess output elided...]
commit 7a0e1fd009dd09a7e764d97da825911ca043b2b7
Author: Michael Herman 
Date:   Tue Dec 3 14:39:16 2013 -0700    add screenshot



:q
puff@redbitter:~/git/flask-tracking$

Use "git cat-file -p commitobjectname" to print the commit contents and get the objectname for its tree.

puff@redbitter:~/git/flask-tracking$ git cat-file -p 7a0e1fd009dd09a7e764d97da825911ca043b2b7
tree 544d9a7b2813d178ac2c4b55c8d0039dd060917a
parent 12ef64e47bd1c2e66d744f2d3e7be5a05b58f2b7
author Michael Herman  1386106756 -0700
committer Michael Herman  1386106756 -0700add screenshot
puff@redbitter:~/git/flask-tracking$

Use "git cat-file -p treeobjectname" to list the entries for the commit's tree.

puff@redbitter:~/git/flask-tracking$ git cat-file -p 544d9a7b2813d178ac2c4b55c8d0039dd060917a
100755 blob a8261fa6c20863cca7fb5f24349fafaf018f01d5	.gitignore
100644 blob 8c4ab1306beecbd315e88562cebbc685f4ebc0b5	README.md
100755 blob c4c60f291c6e799a7058d36b40cbcfd31fbc06c9	app.db
040000 tree bbede8653896980e9a155ece1a0c7dc18e215a5f	app
100755 blob 8a49c0df044da752a8805ec651f240ccb7732431	config.py
040000 tree b8b549203bb081cbeff6edac3fb2865080425da3	docs
100755 blob f50bb713e7e2283b3d9417c9008a42af7a7648f9	requirements.txt
100755 blob 956db43fbcfbc2fc97be2a7573e1739ec8bde4b3	run.py
040000 tree d306aead76ba97a67caaf9c29ac709f0364fe966	screenshots
100755 blob 75d27856ace50f07262a8e618e5c5c29cb4e02aa	shell.py
100644 blob 056117365db6859008fe4079d7d39d33f06fb52b	test.db
puff@redbitter:~/git/flask-tracking$

Pick a tree entry that point to a subtree, and use "git cat-file -p secondtreeobjectname" to list the entries for the subtree.

puff@redbitter:~/git/flask-tracking$ git cat-file -p  bbede8653896980e9a155ece1a0c7dc18e215a5f
100755 blob 49af881397b84c5eef2fa029beec99d35ae4f2df	__init__.py
100644 blob 9915a653d45ec2be3833e9d5c747762776b82c1b	bases.py
100755 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391	constants.py
100644 blob a4adce1c7f2fc526b367d7fb14431bbccaf284be	mixins.py
040000 tree 6b9da9e3995a9ecacb56526d04c3b7f10a9de87a	templates
040000 tree 1c9a97bb12cde881efbdc246fb9cdc61b38f00e8	tracking
040000 tree 1fa8efa7534c6493e77b95d1a59dbaeb5fc8cf1c	users
puff@redbitter:~/git/flask-tracking$

Pick a blob entry in the second tree and use "git cat-file -p blobobjectname" to list the source code in that blob.

puff@redbitter:~/git/flask-tracking$ git cat-file -p  a4adce1c7f2fc526b367d7fb14431bbccaf284be
from app import db
  
class CRUDMixin(object):
    __table_args__ = {'extend_existing': True}
  
    id = db.Column(db.Integer, primary_key=True)
  
    @classmethod
    def get_by_id(cls, id):
        if any(
            (isinstance(id, basestring) and id.isdigit(),
             isinstance(id, (int, float))),
        ):
            return cls.query.get(int(id))
        return None
  
    @classmethod
    def create(cls, **kwargs):
        instance = cls(**kwargs)
        return instance.save()
  
    def update(self, commit=True, **kwargs):
        for attr, value in kwargs.iteritems():
            setattr(self, attr, value)
        return commit and self.save() or self
  
    def save(self, commit=True):
        db.session.add(self)
        if commit:
            db.session.commit()
        return self
  
    def delete(self, commit=True):
        db.session.delete(self)
        return commit and db.session.commit()
puff@redbitter:~/git/flask-tracking$

The Typical Git Workflow, Revisited

Now that you know a bit more about the underlying objectstore and object types, skip back down to "The Typical Git Workflow" and take a look at "Take Two".

Commits

As I said above, the commit is the starting point for a snapshot of your project at some point in time (the point where you ran the commit command). The commit object contains an objectname for a tree object, that tree object then contains a list of more tree objectnames (or blob objectnames) and so on, until you've assembled the whole tree graph of objects that make up the snapshot.

Note: The various books, tutorials and articles I've read seem about 50/50 in whether they say a commit is (or represents) a "set of changes" or a "snapshot". The second term, "snapshot" is more accurate in terms of what git actually does, but the vast majority of the time, that snapshot is used to calculate a change set, since most of the time what we care about is "what changed". People get sloppy in their language and start saying a commit represents a set of changes.

Now take a closer look at the commit from in the example above:

puff@redbitter:~/git/flask-tracking$ git cat-file -p 7a0e1fd009dd09a7e764d97da825911ca043b2b7
tree 544d9a7b2813d178ac2c4b55c8d0039dd060917a
parent 12ef64e47bd1c2e66d744f2d3e7be5a05b58f2b7
author Michael Herman  1386106756 -0700
committer Michael Herman  1386106756 -0700add screenshot
puff@redbitter:~/git/flask-tracking$

Note that author and committer are two separate fields. Sometimes on a project (often, on open source projects) there are a lot of people who develop and contribute code changes (the author), but a smaller group of people responsible for vetting the changes and committing them to the repository (the committer).

The timestamp on the author and committer lines is in unix time format (i.e. seconds since the first second of 1970). The time zone is in RFC 2822 format, basically an offset in hours from UTC; -0400 means that if it's UTC 8am, local time is 4am. Of course git log and similar tools parse the timestamp and timezone for you and produce human-readable datetime values.

See:

https://github.com/git/git/blob/master/Documentation/date-formats.txt http://www.rfc-base.org/txt/rfc-2822.txt

The commit's tree objectname points to the root tree object of the commit.

The idea is that each commit identifies (or rather in git lingo actually each commit is) a tree of source files and subdirectories, a snapshot of your project for for that commit.

Rather than have a real duplicate set of directories and filenames for each commit, the content of each file in your project is stored as a blob in the objectstore, and the commit uses a structure built out of tree objects and blob objectnames to represent the hierarchy of files.

Each commit has an objectname pointing to its "parent" commit, the commit that was the latest commit before you started changing files. (Except for the very first commit which, of course, has no parent.) The "parent" is automatically set to the objectname of whatever the last commit was when this commit was created.

Note: Strictly speaking, the parent is set to the objectname that the symbolic reference HEAD resolves to at the time of the commit command. We haven't gotten to references yet, let alone symbolic references, so for now just remember that HEAD always resolves to the last commit that was created in whatever branch you're currently in.

Note: Also strictly speaking, a commit can have more than one parent objectname. Most of the time it's one parent, but in a case where changes are merged in from elsewhere, there can be multiple parents.

Live Fast, Die Young, Leave a Good-Looking Commit Log

Or: Git Rebase, Squash and Fixup

Once I got past the basic nuts and bolts tutorials for git, I found out what the real git aficianados talk about - philosophies of git workflow, how to use git branching tactically and strategically, and how you keep your git commit log pretty.

At first I was a bit frustrated (well, I still am) and I thought "dammit, when I got to this point with CVS and subversion I was ready to get back to developing!" Then I remembered that of course CVS and subversion are simpler, because they use a simpler, centralized model.

A big player in these git philosophy debates is the git rebase command. I'm not going to get into the details of rebase here (mostly because I'm not confident I understand it well enough yet) but here's a quick overview. Rebase is named so because the "base" is the parent commit that the branch was started from. Rebase modifies this parent commit.

Let's put this situation in terms of using subversion or CVS. Let's take an example where you fall off the "update early, update often" wagon:

You get a fresh checkout and you start working on your new feature. For whatever reason, you end up taking a while before you can update and commit your new feature. By the time you're finished, your checkout is so far behind the latest and greatest that actually updating, merging, and committing are going to be a huge pain in the ass.

But you know that your new feature code doesn't actually modify any of the other code behavior. So instead, you create another checkout, a fresh checkout of the latest and greatest code, and carefully copy changes over from your other checkout, and then commit them from the fresh checkout.

Now compare this situation to git. In git, you start a new branch to work on your code. Time goes by while you're working on it, and you're ready to merge your branch back in. But other work has been committed to the main branch, and your changes aren't really dependent on the earlier version you originally branched from.

So why can't you just update your branch to the latest version, and commit your changes in as if you just started the branch today?

With rebase, you can do just that.

And along the way, rebase will use those message flags to clean up the commit log so it's easier to understand.

Like I said above, there is much discussion about when to rebase, why to rebase, how to rebase, etc, and I certainly can't claim to fully understand it yet. One easier-to-understand case that Scott Chacon talks about in Pro Git is where you want to send some changes to a branch you don't have access to. This lets you do all the merging work up front and bundle it up to send to the maintainer, who can then just apply it all.

Most of the discussions seem to center on using rebase on a private branch (sometimes called a topic branch or a feature branch) to clean up a bunch of tactical or incremental commits (once again, the nomenclature in the git world is anything but unified.). These are commits that you did while working on the new feature. Call then "undo-level commits". Now you're ready to push all of those commits out to the central repo, but you don't want everybody and their brother seeing all of your dirty laundry, so you want to clean up the commit log, remove your hesitations and gyrations, and make it all neat and orderly.

The commit message flags "squash!" and "fixup!" make this cleanup task easier. From the git rebase man page, --autosquash section: http://jk.gs/git-rebase.html

--autosquash
--no-autosquash
 
    When the commit log message begins with "squash! ..." (or "fixup!
    ..."), and there is a commit whose title begins with the same ...,
    automatically modify the todo list of rebase -i so that the commit
    marked for squashing comes right after the commit to be modified,
    and change the action of the moved commit from pick to squash (or
    fixup). Ignores subsequent "fixup! " or "squash! " after the
    first, in case you referred to an earlier fixup/squash with git
    commit --fixup/--squash.
 
    This option is only valid when the --interactive option is used.
 
    If the --autosquash option is enabled by default using the
    configuration variable rebase.autosquash, this option can be used
    to override and disable this setting.

Also see: https://technosorcery.net/blog/2010/02/07/fun-with-the-upcoming-1-7-release-of-git-rebase---interactive---autosquash/ https://technosorcery.net/blog/2012/08/05/updated-git-rebase-interactive-autosquash-commit-preparation/

If I'm understanding this right (always a risky proposition), fixup is for trivial cleanup commits after you do your main commit, things like fixing stupid typos, or commenting out/removing that debug line you forgot. You use a message flag like this:

fixup! SomePreviousCommitMessageText

This tells git's rebase command (when you or somebody else uses it later, in interactive mode) to group these commits together with that commit, and offer the option of automatically combining them. When it does the combining, the original commit message is left as it is, unlike the "squash!" message flag.

A "squash! SomePreviousCommitMessageText" message flag, on the other hand, both combines the commits and gives you the chance to write a new commit message.

And as you can see from the man pages and the second of the two technosorcery.net links above, git also provides command line options to let you do the squash/fixup programmatically. For example, if the HEAD commit - the latest commit - had a commit message of:

Everything's squared away, yessir, squaaaaaared away.

You can enter this command:

git commit --squash HEAD

And git will figure out what the HEAD commit message is, ("Everything's squared away, yessir, squaaaaaared away.") and automatically set this new commit's message to:

squash!  Everything's squared away, yessir, squaaaaaared away.

The Index

There are often times you have a whole bunch of changes to commit, but really it's several different subsets of changes, each of which you'd like to commit as one atomic commit.

To do this under SVN you had to explicitly list each filename in your commit command, or use a GUI that let you pick out a set of changes.

With git, you have the index. You can go through your working tree, looking at each file that has been changed, and one by one add them to the index with "git add". When you have all of one set of changes added to the index, you commit them.

Note: The index is also sometimes called the staging area or cache. However, the file that actually contains it is named .git/index, so we're always going to call it the index.

In the section on "Commits" above, I said that the commit tree lists a snapshot of all of the files in your working tree (that git knows about) at the time of commit.

Which sorta brings up the question of what defines "what git knows about"

Which in turn means "all of the files you told git about with 'git add'".

Which in turn means "all of the files listed in .git/index".

When you checkout a branch, git rearranges your working tree to contain all the files in the latest version of that branch. Git also loads a list of all of those files into .git/index. Git also preloads the .git/index with data from the stat command (see "man stat 2"), which helps git be a little faster and more efficient at keeping track of what's changed.

But the main purpose of the index is as a manifest of all of the stuff that's going to go into a commit tree. When you finish working on a feature, you use some git commands like "git add" to update this manifest. When it's ready, you use "git commit" to build the commit and the commit tree.

Inside .git/index

The file .git/index is a binary file that contains a list (alphabetical by pathname) of:

mode
objectname (40 character hexadecimal SHA-1)
stage number
pathname

Stage number is usually 0, unless it's being used for merge/conflict purposes, when the index has to keep track of multiple SHA-1 values for a single pathname. Then it's 1, 2 or 3. See

http://alblue.bandlem.com/2011/10/git-tip-of-week-index-revisited.html

puff@redbitter:~/git/flask-tracking$ git ls-files --stage
100755 a8261fa6c20863cca7fb5f24349fafaf018f01d5 0	.gitignore
100644 46179834849a6710a679df7ad117e4c5b525bf7d 0	README.md
100755 c4c60f291c6e799a7058d36b40cbcfd31fbc06c9 0	app.db
100755 49af881397b84c5eef2fa029beec99d35ae4f2df 0	app/__init__.py
100644 9915a653d45ec2be3833e9d5c747762776b82c1b 0	app/bases.py
100755 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 0	app/constants.py
100644 a4adce1c7f2fc526b367d7fb14431bbccaf284be 0	app/mixins.py
100644 af5d921f6c9ef88ec27cef13384f65f075c4d2ea 0	app/templates/400.html
[...]
100644 056117365db6859008fe4079d7d39d33f06fb52b 0	test.db
puff@redbitter:~/git/flask-tracking$

This is a nicely detailed description of the actual .git/index file format:

https://github.com/git/git/blob/master/Documentation/technical/index-format.txt

A lot of the fields in the index entries are cached values from stat (see "man stat 2"), including:

last time a file's metadata change
last time a file's data changed
device
inode
mode
unix permissions
uid
gid
file size

This saves git having to call stat on every file in your working directory all the time.

This stackoverflow answer applies the index-format.txt info in a bit more of a human readable fashion.

http://stackoverflow.com/questions/4084921/what-does-the-git-index-contain-exactly/25806452#25806452

Note: assume_unchanged is a flag you can manually set with git update-index. It's a speed optimization for certain systems with really slow implementations of the stat function. It lets you exclude files from being checked for modifications until you manually unset it. See: http://git-scm.com/docs/git-update-index

This looks like it has more discussion, it's probably a local copy of a man page from somewhere else, I'll try to get back here and change this link to a more authoritative source: https://www.mankier.com/1/git-update-index

References (lightweight tags) and Symbolic References (HEAD)

References do something very similar to tags, i.e. map a human-friendly name to an objectname. Symbolic references add an extra level of indirection, a reference to a reference, and are basically only used to implement HEAD (see below).

Let's talk about references first:

References are files under .git/refs. There are three subdirectories of .git/refs;

.git/refs/heads
.git/refs/remotes
.git/refs/tags/

Each reference is a file in one of these directories that contains an objectname (a SHA-1 hash). The filename defines the human-friendly name string. The objectname refers to whatever git object the reference is supposed to point to.

In a new .git, you'll find one reference created by default:

.git/refs/heads/master

puff@redbitter:~/git/flask-tracking$ cat .git/refs/heads/master 
e792016decbd6d548ae43a5d300c9e6ca5a425ee
puff@redbitter:~/git/flask-tracking$

As near as I can tell, those three types of references (heads, remotes, tags) are hard-coded into git. You can certainly create a reference by hand (the git world prefers you use "git update-ref" to create them), but other than putting it in the heads/remotes/tags directories, I don't know if there's anything you can do with a reference.

Some documents, including Pro Git, call references "tags"; they call the tag object in the objectstore an annotated tag, and the .git/refs based reference a lightweight tag. Pro Git says it's generally recommended to use annotated tags, not references/lightweight tags.

Unlike annotated tags, references are

stored differently, not in the objectstore
don't have an objectname themselves (though references map to an objectname)
don't have tagger name string
don't have a message string

The "git tag" command creates tag objects if invoked with -a, -s or -u, otherwise "git tag" creates a reference in .git/refs/tags. "man git tag" says that "lightweight tags are meant for private or temporary object labels". I suspect that also means that while tags in the objectstore are synchronized by git push/pull/clone/etc, reference tags are not.

Now let's look at symbolic references. A symbolic reference is a meta-reference, a reference to a reference; instead of containing an objectname, the file contains the file path to another reference. The path starts with the "refs" directory. Git resolves the symbolic reference to the reference, and then resolves the reference to the objectname.

For example:

puff@redbitter:~/git/flask-tracking$ cat .git/HEAD
ref: refs/heads/master
puff@redbitter:~/git/flask-tracking$ cat .git/refs/heads/master
e792016decbd6d548ae43a5d300c9e6ca5a425ee
puff@redbitter:~/git/flask-tracking$

According to a comp.version-control.git post by the lead git committer, Junio C. Hamano, git uses symbolic references in only two places:

  .git/HEAD 
  .git/refs/remotes/someremotename/HEAD

(http://comments.gmane.org/gmane.comp.version-control.git/166765)

People have experimented with symbolic references (see the comp.version-control.git thread above) but those are not valid uses of symbolic references, and how git will handle them is unpredictable.

http://git-scm.com/docs/git-symbolic-ref says:

"In the past, .git/HEAD was a symbolic link pointing at refs/heads/master. When we wanted to switch to another branch, we did ln -sf refs/heads/newbranch .git/HEAD, and when we wanted to find out which branch we are on, we did readlink .git/HEAD. But symbolic links are not entirely portable, so they are now deprecated and symbolic refs (as described above) are used by default."

HEAD

The symbolic reference HEAD is contained in the file named .git/HEAD, and the contents of that file look something like:

ref: refs/heads/master

In this case, master is the current branchname. You'll see "master" often in the git world, because it's the default name of the default branch you're in when you first create a git repo. A helpful guy on IRC freenode grepped the git source for laughs, and found that the string "master" only has two interesting occurrences in the git source code; when you clone a git repo and when you init a git repo.

The value of refs/heads/master is stored in the file named .git/refs/heads/master, and is an objectname (SHA-1 value) that points to the last commit that was created for that branch.

There are also references and symbolic references. You can create your own, but every branch always has a reference (with the same name as the branch) that points at the latest commit in the branch. That same-named reference is referred to as the head reference. There is always a repo-wide symbolic reference named "HEAD" that points to the currently selected branch's head reference.

Each commit has pointers to previous commits ("parent" commits; the very first commit doesn't have a parent).

The HEAD Reference

Every branch has a name
(the default name for the first branch is "master")
Every branch has a reference that is the same as the branch name.
The branch name reference points to the latest commit in that branch.
The branch name reference is called the head reference.
The entire repo has a symbolic reference "HEAD".
The HEAD symbolic reference points to the head reference for the branch.

HEAD^ is the parent of HEAD;
HEAD^^ is the grandparent of HEAD;
HEAD~3 is the great grandparent of HEAD
HEAD~4 is 4 generations up, the great-great grandparent of HEAD
etc...

Merge commits may have more than one parent, so add a number to indicate which parent, i.e.:

HEAD^1 for the first parent
HEAD^2 for the second parent
etc.

The Typical Git Workflow, Re-Revisited

Once again, we've gone a little further into understanding git object types, so skip down to "The Typical Git Workflow", below, and read "Take Three", then come back here and resume.

Using Python to Decompress a commit Object

Based on an example by thegitguys.com:

First use "git log" to find a commit objectname, in this case 86f274a5f3fd2220e11544a3004a68b3cf44f57f.

Then cd into .git/objects and into the 86 subdirectory, then run python and use python's zlib to decompress the contents of the file named f274a5f3fd2220e11544a3004a68b3cf44f57f.

puff@redbitter:~/git/flask-tracking$ cd .git/objects/86
puff@redbitter:~/git/flask-tracking/.git/objects/86$ pythonPython 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> fb = open("f274a5f3fd2220e11544a3004a68b3cf44f57f", "rb")
>>> line = fb.read()
>>> line = fb.read()
>>> import zlib
>>> zlib.decompress(line)
'commit 424\x00tree 16c7995218fea00bea97e60de16c8bf0eccadf3f\nparent 245d42b5dc9ae2a11860e644129c79e041567920\nauthor John Q Public  1409497211 -0400\ncommitter John Q Public  1409497211 -0400\n\nAdded some extra typos.\n'
>>>

First use "git log" to find a commit objectname, in this case "e2e74a6bf7fbe028ccebada48efc5fc18b360aab".

Remember, git buckets the files into directories, so we want the directory ".git/objects/e2" and the file "f274a5f3fd2220e11544a3004a68b3cf44f57f".

Then run python and use python's zlib to decompress the contents of the file:

puff@redbitter:~/git/flask-tracking$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> fb = open(".git/objects/e2/e74a6bf7fbe028ccebada48efc5fc18b360aab", "rb")
>>> line = fb.read()
>>> import zlib
>>> zlib.decompress(line)
'commit 185\x00tree 70f08b6a9b77c502a3fe381bba0250b2dffd12c1\nauthor Michael Herman  1376873215 -0700\ncommitter Michael Herman  1376873215 -0700\n\nInitial commit\n'

You can see that the decompressed file is a simple text file that uses newlines to separate the elements. Straightforward enough. Now let's look at the tree for objectname "70f08b6a9b77c502a3fe381bba0250b2dffd12c1".

>>> fb = open(".git/objects/70/f08b6a9b77c502a3fe381bba0250b2dffd12c1", "rb")
>>> line = fb.read()
>>> zlib.decompress(line)
'tree 37\x00100644 README.md\x00[\x7fN\xee\xf9\xed\xe2\xa6r\x06\x04$r\x07\x95\x90\x1eO\xca\xa8'
puff@redbitter:~/git/flask-tracking/.git/objects/70$ git cat-file -p 70f08b6a9b77c502a3fe381bba0250b2dffd12c1
100644 blob 5b7f4eeef9ede2a672060424720795901e4fcaa8x	README.md
>>> fb = open(".git/objects/5b7f4eeef9ede2a672060424720795901e4fcaa8x", "rb")
Traceback (most recent call last):
  File "", line 1, in 
IOError: [Errno 2] No such file or directory: '.git/objects/5b7f4eeef9ede2a672060424720795901e4fcaa8x'
>>> fb = open(".git/objects/5b/7f4eeef9ede2a672060424720795901e4fcaa8x", "rb")
Traceback (most recent call last):
  File "", line 1, in 
IOError: [Errno 2] No such file or directory: '.git/objects/5b/7f4eeef9ede2a672060424720795901e4fcaa8x'
>>>

The Implications of SHA-1 objectnames

As a natural consequence of using SHA-1 values for filenames and for internal references (objectnames) git gets lots of nifty things.

Programmers sometimes like to talk about "elegance", which is a concept picked up from the math world. A common example (perhaps the example) of elegance is simple concepts that combine to create usefully complex (but not complicated) systems. Unix is a classic instance of this. The more I get into git's guts, the more I see the elegance of git, because much - maybe most - of git's behavior is a consequence of relatively simple concepts like using SHA-1 hashes to name the objects in the object store.

I suspect this is also why "getting" git can be so slippery for a lot of people (me included). Where other systems have behaviors and implications explicitly defined, in git they arise naturally from the combination of the core concepts. "There's no there, there", to borrow a phrase, but in this case I mean that until you realize this and stop looking for an explicitly defined structure at the heart of git, you'll keep tripping over your own preconceptions.

Related, a lot of "getting" git is getting comfortable with thinking about your codebase as a big graph, and thinking about manipulating that graph. Some of this is understanding the theory, and some of this is nitty gritty details, but I suspect a lot of it is just internalizing that way of looking at things. It's sort of like learning to work with relational databases. A relation - using a column containing a common value to connect one table to another - make straightforward sense, but the bigger picture starts to really click after you've gone through a few examples of decomposing problems into tables joined by relations.

Let's look at some of the nifty things that using SHA-1 hash values for git objectnames give us:

Blobs

First, a natural reduction of data redundancy, since any duplicate files all get stored under the same SHA-1 value filename. Until they diverge, at which point they naturally become separate files, because the SHA-1 values won't match anymore.

Second, since the SHA-1 values are used internally by git as objectnames, you get a sort of built-in referential integrity. The file contents always have to match the SHA-1 value, the SHA-1 value literally can't point to the wrong file. If the file contents get changed, the SHA-1 value no longer matches.

Trees

But wait, there's more! That natural normalization and referential integrity also show up in trees, and when pulling and pushing data between repos.

Remember, tree objects contain are a list of objectnames for either blobs (data, usually code) or other tree objects, call them sub-trees, which in turn contain more blobs and sub-sub-tree objects, and so on.

All of those are identified by SHA-1 objectnames. If you change the data in any blob in the tree, you change the SHA-1, and that has an effect that ripples up to the top of the tree. Let's say you have a commit tree with four levels, and the tree objects are A, B, C, and D:

When you change the content in a blob object, the blob's SHA-1 has to change.

Now tree object D has an entry for that blob, that entry has to change to use the blob's new SHA-1.

But guess what? Tree object D's SHA-1 is calculated from the data inside it, which includes that blob's entry and the new SHA-1.

So now tree object D's SHA-1 changes...

And tree object D is listed in an entry in tree object C.

So tree object C's entry for tree object D changes

Which means tree object C's SHA-1 changes

And rinse and repeat..

All the way up to tree object A, at the top of the graph.

Which, of course, is pointed to by the commit object, which needs a new SHA-1, which is fine, because it's a new commit object.

Again, you see the referential integrity that the SHA-1s just naturally bring to the situation.

Also, you see the decreased data redundancy; when you have two commits, each with its own tree, each commit's tree will actually share not just the data blob objects, but any sub-trees whose SHA-1 is the same. Let's say that tree D in the above example lists that blob, but also lists 5 other tree objects. Those 5 other tree objects stay the same and are shared by both the new commit tree and the old commit tree.

Repo to Repo

SHA-1s come into play in repo-to-repo interactions too. The SHA-1 uniqueness extends across the entire world, so the only way I can have a git object with the same SHA-1 is if the data in my git object is identical to the data in your git object. So I can just copy items from your git repo into my git repo and know that it's impossible for the data to collide - if anything in your repo has the exact same SHA-1, then it contains the exact same data. Of course, git doesn't just overwrite the data, instead it uses that to avoid copying data it already has. An early version of git (git-pasky) used rsync to pull data from remote git objectstores. The current version may even still use rsync, or something extremely similar.

Objectnames and File Renames

Since git is objectname-centric, not filename-centric, git picks up on file renames without being told.

An Example using git log and git catfile -p

Use git log to find one of the commits in this project.

puff@redbitter:~/git$ cd flask-tracking/
puff@redbitter:~/git/flask-tracking$ git log
...
commit 5e35a59a8a872826409ce78bbd312b2b47d6d5ec
Author: Sean Vieira 
Date:   Tue Oct 8 07:55:40 2013 -0400    Adding more error handlers
:q
puff@redbitter:~/git/flask-tracking$ 
puff@redbitter:~/git/flask-tracking$ git cat-file -p 5e35
tree bb3efe2dea53bcf1a0ac628174de9ee774077654
parent 4b6d809019a5f59c9a6bcf85a565d0f740ba16fd
author Sean Vieira  1381233340 -0400
committer Sean Vieira  1381233340 -0400



Adding more error handlers
puff@redbitter:~/git/flask-tracking$

Now use "git catfile -p" to pretty-print the tree, identified by the first four characters of its objectname, "bb3e"

puff@redbitter:~/git/flask-tracking$ git cat-file -p bb3e
100755 blob a8261fa6c20863cca7fb5f24349fafaf018f01d5	.gitignore
100644 blob 8c4ab1306beecbd315e88562cebbc685f4ebc0b5	README.md
100755 blob c4c60f291c6e799a7058d36b40cbcfd31fbc06c9	app.db
040000 tree 1265e29cc0f8b427b649521401b16b34261cd7dc	app
100755 blob 8a49c0df044da752a8805ec651f240ccb7732431	config.py
040000 tree b8b549203bb081cbeff6edac3fb2865080425da3	docs
100755 blob 45cb7c62315e6e8ed4e1e20cd33c3cb74adfb77b	requirements.txt
100755 blob a97a148ebb9c28991aa7c74f21c3b3990218be83	run.py
100755 blob 79bcab7e18963b6131ce6f6d14f108334b494624	shell.py
100644 blob 056117365db6859008fe4079d7d39d33f06fb52b	test.db
puff@redbitter:~/git/flask-tracking$

Backwards Arrows

A git repo is a hierarchy, a big tree graph. That's obvious enough. But what's wasn't as obvious to me, until I read enough, is that the tree is upside down. Or at least my expectations were upside down.

Starting from the outside, looking at the repo and the branch names and everything, it's easy to think I have this branch. And there's the latest commit in this branch. That commit has an SHA-1 objectname that points to a tree. That tree has SHA-1 objectnames that point to sub-trees "underneath" it. And that's how my upside down expectation got built.

But in reality git's tree is much more like a family tree. If you draw out every single tree in the repo, you'd notice that the tree is actually descending from the very first commit in the repo. The latest commit is actually the bottom of the tree, and in fact you can have several latest commits, one for each branch.

The Typical Git Workflow

I have this here to give you some feel for how git works, but I don't want you to get bogged down in it. I'm going to go over this several times, each time adding more detail.

The Typical Git Workflow, Take One

You begin by creating a repo and working tree, either by creating the directory yourself and issuing "git init" from inside it, or by using "git clone" to copy an existing repo and working tree.
You do some work, modify some files in your working tree, create some new some files.
You get to a stopping point - the feature is done, or the bug is fixed. Now it's time to commit.
You use "git status" to see what's changed. You look at the list of new and changed files and identify a set of changes that all belong together.
You use "git add" to tell git about the new files. You can do these all on one line, or you can do them one line per file.
You also use "git add" to tell git about the changed files.
You use "git status" again. The files you just added with "git add" now show up as a separate list, "Changes to be committed". They no longer show up in the rest of "git status".

Note: Using "git add" for changes may seem a little confusing, we'll talk more about that further down.
You use "git commit" to commit the changes.
So far, so normal. This is where it gets a little interesting for people new to git and decentralized revision control. At this point you've fully committed the data, but it's still all on your own machine, in your own development environment. Nobody else knows about it or has a copy of the new changes yet.
The next step is that you push or they pull the new changes. In a nutshell, you make sure that the new data in your .git repo is somewhere that the other developers can get at it, and then ask them to do a "git pull" from it.
This can be as simple as your git project being on a shared drive and the other developers having read access on it. But a more common pattern is that you push the changes to a "server" git repo with the "git push" command.
You issue a "pull request"; you tell the other developers about the change, through a mechanism outside of git proper.
That can be email, or it can be the github pull request feature, or it can be beaning your coworker in the back of the head with a nerf ball from across the room.
This notifies them that there are changes waiting to be pulled.
They pull with "git pull", and then they merge. More on pull & merge later.

The Typical Git Workflow, Take Two

Let's look at the steps again, this time with a little more detail on what's happening behind the scenes

You begin by creating a repo and working tree, either by creating the directory yourself and issuing "git init" or by using "git clone" to copy an existing repo and working tree.

Any way you slice it, when you're done with this step, you have a directory with a .git subdirectory, which contains .git/objectstore and the index, and one or more branches (usually the default "master" branch), and all the other bits and pieces that make git work.
You do some work, modify some files in your working tree, create some new some files.
You get to a stopping point - the feature is done, or the bug is fixed. Now it's time to commit.
You use "git status" to see what's changed. You look at the list of new and changed files and identify a set of changes that all belong together.
You use "git add" to tell git about the new files. You can do these all on one line, or you can do them one line per file.

This step actually creates a new blob object in .git/objectstore; that object contains the file data. This step also adds an entry to .git/index that has the new objectname and the filepath to the original source file.

You also use "git add" to tell git about any changed files.

This step also creates new blob objects in objectstore, with the new version of the file data, and also inserts a .git/index entry with the new objectname and the filepath of where the contents of the blob came from.

Using "git add" for changes may seem a little confusing, but you have to realize that git never really changes blob objects in the objectstore. If you tell it a given file in your working tree has different contents, git creates a new blob object, with a new SHA-1 objectname, and updates the index to point to the new objectname. Git has to do it this way because, after all, the SHA-1 has to reflect the contents of the blob object.

Note: there are now two entries in the index with the same filepath, but different objectnames. When you do "git commit" that will be resolved.

You use "git status" again. The files you just added with "git add" now show up as a separate list, "Changes to be committed". They no longer show up in the rest of "git status".
You use "git commit" to commit the changes.

I'll get into this step in more detail when we revisit this again, after the next section, where I get into the details of commits.
The next step is that you push or they pull the new changes. In a nutshell, you make sure that the new data in your .git repo is somewhere that the other developers can get at it, and then ask them to do a "git pull" from it.

This can be as simple as your git project being on a shared drive and the other developers having read access on it. But a more common pattern is that you push the changes to a "server" git repo with the "git push" command.
You issue a "pull request"; you tell the other developers about the change, through a mechanism outside of git proper.

That can be email, or it can be the github pull request feature, or it can be beaning your coworker in the back of the head with a nerf ball from across the room.

This notifies them that there are changes waiting to be pulled.
They pull with "git pull", and then they merge. More on this later.

Typical Git Workflow Take Three

Once more into the breach, let's now look at the workflow with more detail on what happens at the commit step.

You create a repo and working tree, either with "git init" or "git clone".
You do some work, modify some files in your working tree, create some new some files.
You get to a stopping point - the feature is done, or the bug is fixed. Now it's time to commit.
You use "git status" to see what's changed. You look at the list of new and changed files and identify a set of changes that all belong together.
You use "git add" to tell git about the new files. You can do these all on one line, or you can do them one line per file.
You also use "git add" to tell git about the changed files.
You use "git status" again. The files you just added with "git add" now show up as a separate list, "Changes to be committed". They no longer show up in the rest of "git status".
You use "git commit" to commit the changes.

This step is the git workhorse, it does a bunch of stuff:

It creates a new commit object with author, committer and commit message fields.
It copies the current branch entry's latest commit objectname into the parent slot of the new commit object.
It sets the branch entry's latest commit to the objectname of the new commit object.
It creates a tree of tree objects and blob objects that represents for all of the working tree files.
It sets the new commit object's tree slot to the top tree object.

TODO: the git commit step uses info from the index to build the tree, so of course it also modifies the index. Remember to come back here and explain that.
You push the changes to a "server" git repo with "git push".
You issue a "pull request"; you tell the other developers about the change, through a mechanism outside of git proper.
They pull with "git pull", and then they merge. More on this later.

Some Random Digressions

In Praise of Concreteness

Technical documents invent a worldview and terminology for a set of concepts, and often try to keep the discussion abstract, divorced from the underlying implementation. The problem is, that approach requires using language and explanations that are very awkward for humans to read and understand. I'm going to go to the opposite extreme and talk about the implementation, and how that leads to the abstract concepts.

My Rationale For Writing This

Here's a little background digression, from my second attempt at this introduction.

I started reading about git and quickly got lost in the handwaviness "and then a miracle occurs" nature of the various tutorials and books. This generally bugs the hell out of me, and this time was no exception.

I understand why it happens, it's a lot to swallow at once, and most of the time, most people - including me - just want to scream "stop talking about all that background crap and just show me how to do it!"

The problem is, with git you're not just learning "how to do it", you're also learning what the hell "it" is.

The "just show me how" approach works fine with git, as long as you're just using it as a very shallow replacement for subversion or CVS. The problem is, you quickly have to start dealing with concepts that are beyond subversion or CVS.

Links for Starter Git Tutorials

I've given some examples of git commands as we went, mainly to give you a specific and concrete illustration of the stuff we're talking about. Those examples won't even begin to give you a normal set of git commands for everyday git usage. Here are URLs for some tutorials I've found useful. Also see the section below, "Links for Git Best Practices."

The Git SVN Crash Course is a good starting point.

https://git.wiki.kernel.org/index.php/GitSvnCrashCourse

This quickie little tutorial is a good quick overview of commonly used commands:

https://www.kernel.org/pub/software/scm/git/docs/v1.4.4/tutorial.html

This is a nice little tutorial with a bunch of simple examples of different tasks:

http://www.ralfebert.de/tutorials/git/

Another nice summary of the "everyday" commands. It doesn't go very deep into the commands, but it lists them by the type of user, i.e. the role that user has, which is kind of a useful glimpse into how projects are run with git:

https://www.kernel.org/pub/software/scm/git/docs/v1.4.4/everyday.html

Links For Git Internals

I'll try to make sure this contains a complete list of any URLs I mention above. There may also be some extra URLs that I didn't around to mentioning.

Chapter 7 at http://schacon.github.io/gitbook/ has several excellent essays on different aspects of the .git internals. Unfortunately they're not the comprehensive overview that I'd like.

http://schacon.github.io/gitbook/7_the_git_index.html

http://schacon.github.io/gitbook/7_browsing_git_objects.html

http://schacon.github.io/gitbook/7_git_references.html

http://schacon.github.io/gitbook/7_raw_git.html

http://jk.gs/gitrepository-layout.html

http://git-scm.com/book/en/v1/Git-Branching-What-a-Branch-Is

http://git-scm.com/book/en/v2/Git-Internals-Git-Objects

http://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

http://git-scm.com/book/en/v2/Git-Internals-Git-References

http://eagain.net/articles/git-for-computer-scientists/

https://www.kernel.org/pub/software/scm/git/docs/technical/racy-git.txt

http://www.gitguys.com/topics/glossary/

https://www.kernel.org/pub/software/scm/git/docs/v1.4.4/tutorial-2.html

This is a nice tutorial that sort of combines everyday use with an explanation of what's going on behind the scenes (in the core):

https://www.kernel.org/pub/software/scm/git/docs/v1.4.4/core-tutorial.html

There are actually a fair bit of interesting tidbits scattered throughout the git man pages, for example at one point it says that git pull is equivalent to git fetch followed by git merge, etc.

https://github.com/git/git/blob/master/Documentation/gitrepository-layout.txt

Here're the git docs on what's in the index file:

https://github.com/git/git/blob/master/Documentation/technical/index-format.txt

I was going to write a simple script to parse the index file and pretty print it, but Sean B. Palmer beat me to it. I downloaded it hacked it up a bit to print the output in an HTML table, which I use in the examples above: https://github.com/sbp/gin

https://github.com/git/git/blob/master/Documentation/technical/racy-git.txt https://github.com/git/git/blob/master/Documentation/date-formats.txt

This one is odd, not sure where it fits into the grand scheme of git, but making a note of it here: https://github.com/git/git/blob/master/Documentation/gitattributes.txt

Could come in handy. To quote, "Git will sometimes need credentials from the user in order to perform operations; for example, it may need to ask for a username and password in order to access a remote repository over HTTP. This manual describes the mechanisms Git uses to request these credentials, as well as some features to avoid inputting these credentials repeatedly.": https://github.com/git/git/blob/master/Documentation/gitcredentials.txt

The source for the core git tutorial link above, I think: https://github.com/git/git/blob/master/Documentation/gitcore-tutorial.txt

This has some "git for CVS users" at the beginning that I didn't find that helpful, but later on it has some useful info on tools git has for importing CVS projects. https://github.com/git/git/blob/master/Documentation/gitcvs-migration.txt

What's nice about this one is that it lays out an ontology of git commands for single-developer situations, multi-developer situations, the integrator role, and administrator role: https://github.com/git/git/blob/master/Documentation/giteveryday.txt

Noting this here for later, when I get to writing the section on remotes: https://github.com/git/git/blob/master/Documentation/gitremote-helpers.txt http://git-scm.com/2011/07/11/reset.html http://blog.plover.com/prog/git-reset.html

ProGit Cliff notes: http://lostechies.com/jasonmeridth/2010/04/05/quot-pro-git-quot-cliff-notes/

Hey, this would have been neat to find before I spent a lot of time writing the above, which in many ways just reconstructs this. On the other hand, it hints at getting into the gory details of the index, but doesn't really: https://github.com/git/git/blob/master/Documentation/gittutorial-2.txt http://stackoverflow.com/questions/1450348/git-equivalents-of-most-common-mercurial-commands/1450641#1450641 http://stackoverflow.com/questions/4084921/what-does-the-git-index-contain-exactly/25806452#25806452 http://lists-archives.com/git/425812-rationale-for-git-s-way-to-manage-the-index.html http://thread.gmane.org/gmane.comp.version-control.git/46341 http://thread.gmane.org/gmane.comp.version-control.git/32452/focus=32610 http://lists-archives.com/git/427450-rationale-for-git-s-way-to-manage-the-index.html

This article is from the dawn of git, dated April 2005, when the index was called the "directory cache". I'm sure git has gone through a lot of evolution since then, but this is still interesting to me: http://lwn.net/Articles/131657/

Some interesting history of the evolution of git terminology and command names: http://agileotter.blogspot.com/2014/12/my-itsy-bitsy-contribution-to-git.html

Some interesting discussion about various approaches to git workflow, especially oriented towards beginner gotchas: https://news.ycombinator.com/item?id=2970149 http://rypress.com/tutorials/git/rewriting-history http://rypress.com/tutorials/git/plumbing http://rypress.com/tutorials/git/rebasing

Neat! This guy goes through the git repo for git itself and talks about the history and evolution of git. I hope he continues and gets deeper into it. http://fabiensanglard.net/git_code_review/history.php

To quote: "A Visual Git Reference This page gives brief, visual reference for the most common commands in git. Once you know a bit about how git works, this site may solidify your understanding. If you're interested in how this site was created, see my GitHub repository." http://marklodato.github.io/visual-git-guide/index-en.html

"In this post I will try to explain the underlying commands and to a level the internal working of the git system involved when making a 'commit'." http://beatofthegeek.com/2014/01/git-commit-illustrated-simplicity.html http://www.quora.com/How-does-git-stash-work

The Git Tree Object Format

http://stackoverflow.com/questions/14790681/format-of-git-tree-objec

The format of a tree object:
 
tree [content size]\0[Entries having references to other trees and blobs]
 
The format of each entry having references to other trees and blobs:
 
[mode] [file/folder name]\0[SHA-1 of referencing blob or tree]
 
I wrote a script deflating tree objects. It outputs as follows:
 
tree 192\0
40000 octopus-admin\0 a84943494657751ce187be401d6bf59ef7a2583c
40000 octopus-deployment\0 14f589a30cf4bd0ce2d7103aa7186abe0167427f
40000 octopus-product\0 ec559319a263bc7b476e5f01dd2578f255d734fd
100644 pom.xml\0 97e5b6b292d248869780d7b0c65834bfb645e32a
40000 src\0 6e63db37acba41266493ba8fb68c76f83f1bc9dd
 
The number 1 as the first character of a mode shows that is reference to a blob/file. The example above, pom.xml is a blob and the others are trees.
 
Note that I added new lines and spaces after \0 for the sake of pretty printing. Normally all the content has no new lines. Also I converted 20 bytes (i.e. the SHA-1 of referencing blobs and trees) into hex string to visualize better.

A good discussion of the reflog:

http://gitready.com/intermediate/2009/02/09/reflog-your-safety-net.html

Links for Git Best Practices

Doing it "the right way" is different in git, so here are some URLs for tutorials I found useful: http://sethrobertson.github.io/GitBestPractices/ http://sethrobertson.github.io/GitFixUm/fixup.html

Here's a great big flow chart that lays out a workflow for what to do when you've got a mess. I wouldn't advise blindly entering the commands, but it looks like a good starting point to figuring out what you need to read up on:

http://justinhileman.info/article/git-pretty/git-pretty.png

http://justinhileman.info/article/changing-history/

http://nvie.com/posts/a-successful-git-branching-model/

https://sandofsky.com/blog/git-workflow.html

http://scottchacon.com/2011/08/31/github-flow.html

http://stackoverflow.com/questions/612580/how-does-git-solve-the-merging-problem

Wincent Colaiuta has a blog post with some interesting insights into Git merge and other topics, quoting from and commenting on a mailing list discussion between Linus Torvalds and Bram Cohen (of Bittorrent fame but here speaking as developer of Codeville):

http://www.wincent.com/a/about/wincent/weblog/archives/2007/07/alookback_bra.php

And the discussion in question:

http://www.gelato.unsw.edu.au/archives/git/0504/2153.html

git-wtf is a neat ruby script to show you the current state of your git repo:

http://git-wt-commit.rubyforge.org/#git-wtf

On writing good git commit messages:

http://chris.beams.io/posts/git-commit/

An interesting discussion of rebase vs. merge:

http://stackoverflow.com/questions/804115/when-do-you-use-git-rebase-instead-of-git-merge

One answer links to Linus' own comments:

http://thread.gmane.org/gmane.comp.video.dri.devel/34739/focus=34744

This has an extensive and involved answer but it's worth reading. I need to go back and reread this a couple more times.

http://stackoverflow.com/questions/3329943/git-branch-fork-fetch-merge-rebase-and-clone-what-are-the-differences/9204499#9204499

Another extensive discussion of rebase:

http://mettadore.com/2011/05/06/a-simple-git-rebase-workflow-explained/

Git From the Bottom Up:

https://jwiegley.github.io/git-from-the-bottom-up/ https://github.com/jwiegley/git-from-the-bottom-up

Branching. I've been told that to really "get git" you have to not just understand branching, but embrace branching. I haven't yet, but this looks like a good tutorial:

http://www.git-tower.com/learn/git/ebook/command-line/branching-merging/branching-can-change-your-life

Links for Example Sources

Here are URLs to download the sources I use in the above examples:

Most of the examples I use above come from the git project for the python flask-tracking app, from this nifty tutorial:

https://realpython.com/blog/python/python-web-applications-with-flask-part-i/

The git repo for flask-tracking is at:

https://github.com/mjhea0/flask-tracking

To make your own git clone of flask-tracking so you can follow along, do:

puff@redbitter:~/git$ git clone https://github.com/mjhea0/flask-tracking.git

When you first clone this repo, the .git/objects will be in packed format, which isn't much fun because we can't go exploring individual object files. To unpack them, follow the advice here: http://stackoverflow.com/questions/16972031/how-to-unpack-all-objects-of-a-git-repository

Basically these two steps:

move the .git/objects/pack directory out of .git/, for example to your home directory, then

run the git command "git unpack-objects < ~/pack/packfile.pack" to unpack the packfile contents back into .git/objects.

puff@redbitter:~/git$ cd flask-tracking
puff@redbitter:~/git/flask-tracking$ ls -l .git/objects/
total 8
drwxrwxr-x 2 puff puff 4096 Nov 11 16:32 info
drwxrwxr-x 2 puff puff 4096 Nov 11 16:32 pack
puff@redbitter:~/git/flask-tracking$ ls -l .git/objects/pack
total 200
-r--r--r-- 1 puff puff   9976 Nov 11 17:26 pack-250887822b23d6aca52910105eb74bbc2f102825.idx
-r--r--r-- 1 puff puff 188914 Nov 11 17:26 pack-250887822b23d6aca52910105eb74bbc2f102825.pack
puff@redbitter:~/git/flask-tracking$ mv .git/objects/pack/ .
puff@redbitter:~/git/flask-tracking$ git unpack-objects < pack/pack-250887822b23d6aca52910105eb74bbc2f102825.pack  
Unpacking objects: 100% (318/318), done.
puff@redbitter:~/git/flask-tracking$ ls .git/objects/
01  0a  12  18  1f  26  2c  32  3b  43  49  57  5e  65  6c  75  7e  83  8a  91  9b  a2  a9  af  b7  bf  c6  cc  d1  d7  e1  e6  eb  f2  f8  ff
05  0b  14  19  21  28  2d  36  3c  44  4b  59  5f  66  6e  76  7f  85  8b  95  9d  a3  aa  b0  b8  c2  c7  cd  d2  d8  e2  e7  ed  f3  f9  info
06  0c  15  1a  23  29  2e  37  3e  45  4f  5b  62  69  6f  78  80  86  8c  96  9e  a4  ab  b1  ba  c3  c8  ce  d3  d9  e3  e8  ee  f4  fa
07  0f  16  1b  24  2a  2f  39  3f  46  51  5c  63  6a  70  79  81  88  8d  98  9f  a6  ac  b5  bb  c4  ca  cf  d4  db  e4  e9  f0  f5  fd
09  11  17  1c  25  2b  31  3a  41  48  54  5d  64  6b  74  7a  82  89  8e  99  a1  a8  ae  b6  be  c5  cb  d0  d5  de  e5  ea  f1  f6  fe
puff@redbitter:~/git/flask-tracking$

Cheatsheet

There are many and better tutorials out there, but I'm sticking this list of git commands here for my own quick reference.

Command	Description
git init	Create a new repo (.git) in the current directory.
git clone https://githuburl	Create a local copy of a repo on github
git clone https://githuburl newdirectorname	git clone with a specified output directory name
git log	list the commits
git cat-file -p objectname	pretty-print the contents of this object
git unpack-objects < pack/packfilename.pack	Expand a packfile (after moving it out of .git) into individual objects in .git/objects
git ls-files --stage	list all the files in the index
git status	List the status of the working tree
git add path	add/stage file or files in directory, recursively
gti commit	commit staged files
git rm path	remove file or directory from the working tree
git mv oldpath newpath	move file or directory to new location
git diff	show diff of changes in working three
git diff path	show diff of changes for path



git branch	List local branches
git branch -r	List remote branches
git branch -a	List all branches
git checkout existingbranchname	Make your working tree files identical to the set defined by branchname and set your current branch to branchname.
git checkout -b newbranchname	Create branch newbranchname, based on the current branch, and set your current branch to newbranchname.
git checkout -b newbranchname existingbranchname	Create branch newbranchname, based on existingbranchname, and set your current branch to newbranchname.
git remote add origin repoURL	create a local tracking branch named origin for the repo at repoURL
git fetch origin	copy changes from the remote branch to the local tracking branch for the branch named origin

This is an interesting take on a cheatsheet:

http://ndpsoftware.com/git-cheatsheet.html#loc=index;

Credits

Of course, credit is due to Linus Torvalds and all of the authors of various git tutorials and books that informed the above.

Also the fine and helpful folks on freenode #git, including (in no particular order):

Seveas, offby1, thiago, milki, kadoban, sitaram, Eugene, bheesham, mattcen, ikke, J1G, and certainly others I've managed to lose track of in the process.

Random Tidbits

I'm adding this here to keep track of nifty bits of detail that I come across, which don't really fit anywhere else>Git's tree object sorting

For git commit trees to share tree objects, tree objects have to have the same hashcode, which means they must have not only the same contents, but in the same order. The link above is to the point in the git code that calls qsort().