Categories: Programming, Git
Introduction
This article presents the fundamental data structures and algorithms that underly the Git version control system. This isn’t a tutorial on how to use Git, but after understanding these principles other Git manuals and tutorials may be easier to grasp.
The excellent book “Pro Git” is available online, and has similar information in Chapter 9, Git Internals. This article is significantly shorter than Pro Git Chapter 9, presents the concepts in a somewhat different order, has less implementation-specific detail, and tries to show the link between common Git commands and the effects on the underlying datastructures. If you are learning Git, I would suggest skimming this article first, then reading Pro Git while using this information as supporting background when Git commands start to seem more like voodoo than science.
Chapter 22 of Git Immersion covers similar territory to the ‘Git Internals’ chapter of the official book. If, after reading this article, you are still hungry for Git implementation details then this is also a good source to continue with.
The alternative to learning the concepts first is instead to treat Git as a magic black box; exactly the approach taken by Git-Magic (although even that book ends with a chapter about the underlying details). However, if you are a programmer (and if you are using Git then you probably are) then learning the concepts first really will help.
Git Objects
Git maintains a persistent pool of nodes keyed by checksum (ie a key-value store mapping checksum => node), which it calls “objects”. Each node (“object”) has:
- an object type
- type-specific fields
- content
Object types:
- Blob - holds a block of user data (eg the contents of a file, but not filename or other metadata)
- Tree - holds a map of (filename => checksum) - and therefore implicitly (filename => object)
- Commit - holds (commit-message, patch-checksum, list of parent-checksums, root-dir-object-checksum)
- Tag - holds (tagname, commit-object-checksum)
While it is useful to know that these objects exist, only the “commit” object type is usually directly referenced from the git command-line. Except in advanced cases, tree/blob objects are only implicitly accessed via the “commit” object they are attached to. Tag objects are referenced by their “tagname”.
Note that “object” is not meant in the sense of object-oriented, but rather in the ordinary English language meaning of “thing”, “item”. Because Git “objects” are often logically linked together using checksums as references/pointers, “objects” are often better thought of as nodes in a graph.
Storing the Objects
In the simplest case, the map of checksum=>object is stored in the filesystem with filename=checksum and file-contents=object. Then loading the object with a specific checksum is simply a matter of opening the file with that name. Of course this doesn’t scale very well; the first optimisation is to distribute files across multiple subdirectories whose names are chars (1..2) of the checksum. To save disk-space, object file content is compressed with zlib. A Git “packfile” representation is another alternative storage approach for sets of objects.
The checksums are calculated using the SHA1 algorithm, and are long enough that it is extremely unlikely that two objects in the same repository will have the same checksum.
Representing a Filesystem using Git Objects
Given the “blob” and “tree” object types, a hierarchical filesystem can be created. A Git tree object can be considered equivalent to a unix directory inode, and similarly maps names to object-ids. The major difference is that traditional inodes have ids that are not particularly significant - an inode might have id 17 or 18994, it is just an internal implementation detail. In Git, the id is a checksum of the contents which means:
- the same data can potentially be referenced from multiple places (like unix hard-links)
- the same data has the same id even on different computers
- it is easy to find duplicated contents
- the contents of a file cannot be changed in place
The last is the most significant; if a file needs to be updated, then a new blob object is created with the new contents. Because the checksum has changed, the tree object that referred to the old blob then needs to be updated - which means a new tree object must be created, the old contents copied, the checksum of the old file replaced with the checksum (ie key) of the new version. And if that tree object is referred to by a parent tree object, then that needs to be updated too. The effects therefore “bubbles up” to the top of the directory tree, resulting in a new root tree (directory) object. The old root tree (directory) object still exists - ie the “previous version” of the filesystem can still be accessed if you start from the right point.
This “bubble up” effect occurs in other filesystems, notably BTRFS and Subversion’s filesystem. In fact, this part of Git is pretty similar to Subversion.
It may seem weird to use a filesystem to hold a store of (checksum->object), then use those objects to define a filesystem, but the difference is that the filesystem built from object nodes has some unusual properties:
- copy-on-write (and therefore versioned);
- consistent across multiple hosts
The pool of objects can also hold objects that are not part of any “filesystem”, ie things that need to be persistent but which are not expected to ever be represented on disk as files. Examples include commits and signed tags. These are addressable only by their checksums.
Checking out files
A node-based filesystem can be “checked out”, ie converted to a traditional filesystem, by simply starting at a tree object. For each (filename,fileattrs,checksum) entry in that tree object:
- get the object with that checksum
- if the object is of type “blob” then create a file with the specified name and attributes and write the data contents of the object into it.
- if the object is of type “tree” then create a directory with the specified filename and recursively process the list of (filename=>checksum) entries in that tree (directory) object.
The files that result from “checking out” a version is called a “working copy”.
Switching to another checkout
Once a specific version of the filesystem has been checked out, it is possible to then perform git checkout {checksum}
on a different tree object. The work done to switch versions can be made very efficient by comparing the tree objects of the current and new checkouts; for example if the root tree object of the new checkout points to any “subdir” tree object whose checksum is the same in the currently checked-out version, then there is no need to process that subdir; it has the same files in both versions. Similarly, if any “file” entry in a tree object has the same checksum in the current and new versions, then there is no need to update the file contents from the blob object.
The current state of the Git “index” file may also help speed up checkouts; see the section on the Index file later in this article.
Commits
A commit is stored as an object (node) in a repository’s pool of objects, with its own object-type and associated fields.
A commit always contains a reference to (ie the checksum of) a tree object that is the root dir of the filesystem that is the result of this commit. Therefore, given the id (checksum) of a commit, checking out the state of the software at that point is trivial (see above).
A commit object of course has an internal field that contains a commit message.
The commit object also internally stores the checksums of zero or more parent commits, ie the keys of other objects of type ‘commit’. Zero parents is a special case - normally only for the first commit to a new repository. One parent is the normal case and indicates “linear” development. Multiple parents occur in the case of a (non-fast-forward) merge.
Note that (unlike some other version-control systems) Git does not directly store “diffs” or patches, but instead stores ‘snapshots’ of the state of the filesystem at each commit; the differences between two commits is computed from the snapshots when necessary. Git does internally use diffs within its “packfiles” (effectively zip-files containing many object files), but this is just an internal space-optimisation detail.
A Normal (Linear) Commit
When git commit
is run, the list of modified files (relative to the most recent checkout, ie the ‘parent’) is determined. Then the following new objects are added to the Git storage:
- a new blob object for each added or modified file
- new tree objects for all directories containing added or modified files (containing the checksums of the above file objects)
- new tree objects for directories containing the above modified tree objects, etc - ie the “bubble up” process described above, resulting eventually in a brand new “root” tree object
- a commit object, containing:
- the checksum of the new “root” tree object, ie a filesystem describing the state after this commit
- the checksum of the parent commit object
- the name of the patch author/committer, date committed, etc
- the commit message
A Merge Commit
Merging of two branches of development is done in the following steps:
-
checkout
the “target” commit (usually the head of some branch) -
git merge
some other commit (usually the head of some other branch that diverged from the primary some time in the past) - if merging resulted in conflicts for some file, then manually modify things until all files have the desired state.
- do
git commit
As an example, here is an initial graph of commits:
A = B = C = D
\
\= E = F
Assuming a user checks out D, then merges the branch with HEAD=F into it, and makes some manual fixups to resolve conflicts, the result of git commit
is then the following:
A = B = C = D = G
\ /
\= E = F /
If you are used to SVN, CVS or similar then the above looks odd. The commit structure is no longer linear (ie E/F are not put “on top of” D); instead there are now multiple paths from the head commit (G) backwards in time. This is the critical difference between SVN/CVS and Git. This structure does make viewing history in a linear fashion (git log
) a little odd and unpredictable. However the repository truly reflects the way the code was developed. This approach also works better when doing repeated merges, or where A..D is not a “centralised trunk branch” but instead something that others pull commits from.
The inbuilt git help (git help merge
) describes the merge algorithm in full detail. However to summarize, the algorithm applied by Git for the merge is approximately this:
- find the most recent common ancestor of the two branches being merged (in this case, B).
- reapply each change on the “from” branch since the common ancestor (ie E,F) to the head of the “to” branch (ie D).
Any commits on the “from” branch which have already been “cherry-picked” into the to-branch are skipped.
If a patch does not apply cleanly to any particular file, then remember that file.
If no conflicts occurred (ie no patches failed to apply), then a git commit
is automatically triggered.
If any patches failed to apply, then the user is presented with a list of the problem files, and must manually edit the files to the appropriate state then mark the problem as resolved (by calling git add
) before finally calling git commit
.
The git commit
operation then simply:
- creates new blob/tree objects that store a filesystem matching what is currently in the “working directory” (well, actually the index)
- creates a commit-object which has:
- checksum of the new “root tree object” created above
- checksum of the commit-object for parent0 (object D)
- checksum of the commit-object for parent1 (object F)
- the name of the patch author/committer, date committed, etc
- the commit message
At this point, development can then continue further based on either F or G (or indeed, any other object if desired).
Merges can also be 3-way, or even more, eg merging commits F,X,Y and Z into D. In this case, the git merge
processing (find ancestors, apply patches) is applied to F,X,Y and Z in order, and the “commit object” points to them all as parents.
A special case of merge is where the branch structure looks like this:
A = B = C (branch1)
\ D = E (branch2)
In this situation, merging patches D/E back into branch1 does not need to create a merge-commit; all that is necessary is to mark commit E as being the head of branch1 (ie write e’s checksum into the reference-file for branch1). The result is just as if D/E had been developed on branch1 in the first place. This is called a “fast forward merge”. And in fact, any merge can be transformed into this style by rebasing the “from” branch before merging it.
FYI, fast-forward merging is very common when using git pull
to retrieve updates from a remote repository; git fetch
updates a “remote tracking branch”, and that branch is then merged into a local development branch. If no commits have been added to the local branch since the previous update from the remote repo, then fast-forwarding is all that is needed. Pull/Fetch operations are discussed below, so don’t worry if this paragraph doesn’t make sense yet.
A “fast forward” merge does look much more like the kind of thing that users of SVN/CVS/etc are used to (a linear history). And in fact, when using the “git-svn” bridge between Git and svn, “fast forward merges” are the only ones that can be pushed to an SVN repository.
A normal branch can be converted into the above format (suitable for a fast-forward merge) via the git rebase
operation; see later.
Cherry-picking commits
It is possible to merge just one patch from another branch, without merging that patches’ ancestors.
The resulting commit has just one attached patch, being based on the patch associated with the source commit + any fixups needed. There is no explicit link back to the “picked” patch; Git tools have a configurable “fuzz factor” that they use to find “similar” changes when doing later merges, and simply skip them.
Note however that with normal (and fast-forward) merges, the original commits are completely untouched. It is therefore a fairly simple process to determine which commits in one branch (or a remote repository) have already been merged into some other branch. Cherry-picking instead creates a new commit which happens to have a very similar patch-file in it. Fetch/merge/rebase operations then need special code to skip cherry-picked commits.
Intermission
With just the above features, a fully-functional version control system exists; ie all that is really needed is:
- datastructures: blob/tree/commit objects findable by sha1 checksums
- operations: checkout, merge, commit
- some way to specify the parent of a commit
The system is already a full distributed version-control system; the rest is just helpful “sugar” and performance optimisations on top of this base (ok, references and efficient remote fetches are very useful additions). In fact, originally the rest of Git was built using shellscripts; nowadays the primary implementation is in C but the point is the same - the core of Git is very elegantly small and simple.
For the curious, you can inspect a Git repo directly to find object checksums, then use various commands (git cat-file
, git ls-tree
, git show
) to get info on that object. In particular, it is interesting to run git log
to find a commit#, then use the following commands to view it, its associated tree, etc.:
-
git cat-file -t {checksum}
- prints the object type (commit, tree, blob, tag) -
git cat-file -p {checksum}
- prints the contents of an object -
git ls-tree {checksum}
- prints the contents of a tree object in slightly prettier format than cat-file -
git show --raw {checksum}
- works for some objects but not others. Particularly useful for ‘commit’ objects, where it shows the (filename, oldchecksum, newchecksum) tuples that describe what changes the commit actually made relative to the previous commit’s filesystem. From this information, a “patchfile” can be quickly created by Git.
Because objects are compressed with ‘zlib’, running zlib-flate -uncompress .git/objects/AA/BBBBBB....
can also be interesting; works pretty well on commit and blob objects but not so well on tree objects.
Note by the way that the process of creating a branch hasn’t been mentioned. That’s because a new commit can specify any commit object in the repository as its parent; when there is no other commit that has the same parent then this is traditional linear development, and when multiple commits share the same parent then a “branch point” exists. Support for “branching” is therefore an almost trivial property of the design of the repository. Think of the repository as holding one or more “directed acyclic graphs” of commits. A commit that has no parent is the initial object (node) of a graph, and each subsequent commit object has zero or more “upstream” objects. Occasionally the graph will diverge (tree-like) and sometimes the branches will meet again (merge). Normally a repository will have only one “initial” commit (with no parent) but that isn’t a firm rule. And normally the graph will have a fairly low branching factor (ie linear development is more common than branching).
The following sections describe some of the most common “helpful sugar” commands, and how they interact with the repository. As noted in the introduction, this isn’t meant to be a full Git tutorial or manual; Git has many commands with many options. However understanding the way common commands manipulate repository state will help in understanding the more complex cases.
Remote Repositories
Forgetting for the moment about “references” (addressed below), and performance optimisations, cloning an existing repo and fetching commits from another repo are trivial.
To clone a repo, just get a bitcopy of the original. All the history is there, in the objects of the repository.
Importing (“fetching”) commits from one repository into another is almost as trivial:
- get a bitcopy of the repo being imported
- for each checksum->object mapping (ie each file) in the imported repo
- verify that the file contents checksum is correct (security)
- if the checksum doesn’t already exist in the local repo, then install the file there
The result is that the local repo now holds both the previous commits and any commits that were in the remote repo but not yet local. Because the SHA1 “names” for objects uniquely identify the content, it is trivial to see whether a particular version of a file (or version of a directory of files) already exists in the local repo or not. In fact, a blind copy that overwrites existing files will still work; commits that exist in the local repo point to their parent(s) via checksum values, which don’t change. So as long as objects are not removed from the repo, all will be fine. Even if the remote repository contained files that are bit for bit identical with files in the local repo (eg because both had separately pulled files from some other repo), this common data will have the same checksum regardless of how it reached each repository, and so there is no confusion.
In addition, commit objects that were imported will point to objects representing the filesystem resulting from that commit. Because directories are objects with checksums, any common subdirs will have the same checksum. So a patch that touches one file deep within the tree will be just as efficiently represented in the local repository as in the remote repo from which the commit was imported - if the local system has copies of the directories/files already then they are not duplicated.
The standard git clone
/git fetch
commands do optimise the process a little, and track some metadata to make the operations more convenient; see later.
Git References
While it is possible to create commits by explicitly providing the checksum of the parent commit (or commits for a merge), it isn’t very convenient. And using checksums to specify what tree to check out (eg to switch “branches”) is equally clumsy. So Git adds the concepts of a “reference”; a reference has a simple human-readable-name, and its value is either a checksum or the name of another reference. References are stored simply as a file (under directory .git
) whose name is the human-friendly reference name, and whose contents is either the checksum, or ref:{some-other-ref-file-name}
. You’ll find some in the .git
subdirectory of any repository; just poke around.
Note that after a Git “packing” pass (which optimises performance and storage-space), the contents of reference files are merged into a .git/packed-refs file; if you can’t find a reference you are looking for, then check there!
Reference Types
References belong to the following categories:
- local tag reference
- branch reference
- remote branch reference
- HEAD reference
A local tag reference is trivial: a name points to the checksum of a commit that is important to the developer for some reason. A new tag can be created via git tag {name}
, which simply creates a file with the specified name containing the checksum of the currently-checked-out (HEAD) commit. A tree can be “checked out” of the repository by specifying the human-readable name instead of the equivalent checksum. Checking out a tag also sets the HEAD reference; see below.
A branch reference is not much more complicated than a tag. Again, a file in the .git
directory whose name is the “branch name” contains the checksum of a commit object, and “checking out a branch” is equivalent to checking out a tag. And just like a tag, git branch {name}
just creates a file with the specified name and the checksum sum of the currently-checked-out commit. The difference from a tag is in the way the HEAD reference is updated (see later).
The HEAD reference is, just like other references, a file (named .git/HEAD
) containing a raw checksum or the name of another reference (ref:{filename}
). It has the following effect:
-
After a tag has been checked out, the HEAD reference contains the raw checksum that the tag reference file had, ie it points to the same commit. Doing
git commit
creates a new commit object with the checksum from HEAD as the parent. The contents of the HEAD file are updated to the checksum of the newly created commit; the tag file is not updated. This is not a recommended operation - it is valid, but has effectively created a new commit object for which no nice human-readable name exists. It is therefore quite easy to “lose track of” this commit in the repo and be unable to find it again. Generally, commits should be done only after checking out an object using a “branch name”, not a “tag name”. Note: when the HEAD file contains a raw checksum, it is said to be a “detached head”. -
After a branch has been checked out, the HEAD reference contains
ref: {name of branch-file}
(and as noted above, the branch-file will contain the checksum). When a commit is created, the parent of the commit is the checksum from the file that HEAD references (ie the “tip of the branch”). In addition, the contents of the branch file are updated to have the checksum of the newly created commit. In effect, the branch tip has “moved forward” - and HEAD still points to the branch (contains the branch name, not a raw checksum).
A remote branch reference is one that points to an object imported from some other repository; in all other ways it is a standard reference-file that holds the raw checksum of an object in the local pool of objects. All reference files associated with a specific repo are stored in a directory specifically for that remote repository, so the names don’t clash with local branches; their names are of form reponame/refname
, eg origin/master
. Although these reference files have a valid object checksum, Git refuses to “check out” remote branch references directly. Instead, a local branch or tag reference must be made to point to the same commit that the remote branch reference points to (which is most easily done with the merge command). This restriction is so that remote branch references remain in-sync with the repository they are drawn from; their value is important when fetching updates from the remote repository. Because these branches can’t be checked-out (ie HEAD will never hold the name of a remote branch reference file), they can never be committed to (ie updated as a result of git commit
)! However if you wish, you can still manually check out a commit object using the checksum in the reference file, and then create a normal tag or branch for it (or use git merge
). What is important, though, is that because HEAD never contains the name of the remote branch file (as when checking out a normal branch), remote branch references never get updated when doing commits to the local repo; the only thing that updates the checksum in a “remote branch reference” is a git fetch
operation.
Remote Repositories and Git Fetch
An existing Git repository can be informed of the existence of an external repo via the git remote add {name} {url}
command. This:
- adds a mapping for name=>url into
.git/config
- creates a directory
.git/refs/remotes/{name}
to hold reference files copied from the remote repository
Importing of commits from other repos was partially covered above. As described there, it is simply a matter of copying the desired commit/file objects from the remote repo into the local one. To avoid having to copy all objects from the remote repository for each update, the existing reference files are used to determine what commits for each branch have already been imported - hence the need to keep these reference files ‘pure’ (see previous section).
When a repository is cloned, some of the above is automatically configured, using a standard naming convention. In particular:
- a local branch named ‘master’ is created
- the cloned repository is registered in
.git/config
under name ‘origin’ - a mapping from local branch ‘master’ to ‘origin/master’ is also stored in the
.git/config
file. This sets the default parameters for thegit pull
command when the local ‘master’ branch is checked out. The same thing can be set manually via the--tracking
option to some commands.
The command git fetch {reponame}
looks for an entry in the .git/config
file for that repository to find the URL of the repo, and then in the same file finds the branches that are being tracked. It then reads the checksums from the reference files in .git/refs/remotes/{reponame}/
to see which objects it already has, and passes this info to the remote repo. The remote repo then passes back all objects that have been added to the specified branches since the provided commit checksums. After these objects have been added to the local pool of objects:
- file
.git/FETCH_HEAD
is updated to list the “fetched branches”, and their checksums - the reference-files for the remote repository are updated with the checksum of the latest commit to each branch
- new reference-files are added if the remote repository has new branches
As noted previously, Git refuses to check-out branches stored under the “remotes” directory, ie refuses to allow the HEAD file to contain the name of a file under .git/refs/remotes
. However you can merge the contents of this branch into some other branch, and check that out - which is the usual way of working with remote code. This is in fact what git pull
does by default.
Remote Tracking Branches
As noted, Git has a single mechanism for tracking branches : a reference file. The only differences between a branch “from a remote repository” and one “from the local repository” are:
-
the branch reference for remote repos are stored in a directory whose name is the “remote repository name”, eg “.git/refs/remotes/origin”. Note that the contents of the reference file is still just a checksum which points to a commit-node in the local store of objects.
-
if
git checkout
is given a remote reference, eggit checkout origin/master
then HEAD is not updated to point to origin/master but instead is set to contain the raw checksum. A warning message about “detached HEAD” is then given. Any commit made in this state then effectively is in an “anonymous” branch, and does not cause the remote branch reference to be updated.
For convenience, Git does support a way of “linking” a pair of (local,remote) branches together though; when this is done then the local branch is said to “track” the remote branch, and:
-
a
git pull
when on the local branch will triggergit fetch
to update the linked remote branch, and then triggers agit merge
to merge any new commits fetched from the remote repository into the current branch. -
a
git push
when on the local branch will send any commits on the local branch which are not on the linked remote branch to the associated remote repository.
When a repository is initially cloned, some tracking branches are set up. They can also be set up manually with various options to git branch
and similar commands.
Pruning branches and commits
At the logical level, removing a branch is simply a matter of deleting the reference file, meaning the branch is no longer accessable by name. However the objects still exist in the repository. There is a tool available which scans the repository for all commits which are not pointed to by a named reference, nor are an ancestor of such a commit. These objects can then be removed from the repository.
On the other hand, there are also various tools and tutorials around for finding such objects and giving them names again - so presumably there are cases where people have found that “recovering” such objects has been important. See the reflog section below for further information on this topic.
Deleting commits which are at the “tip” of a branch, ie which no other commits point to as their parent is easy - the branch reference just needs to be updated to point to the preceding commit. See the various forms of “git reset”. Deleting commits elsewhere is trickier; see the Git tutorials listed in the reference section.
File Renames and Removals
Git doesn’t explicitly track renames; instead it assumes that if two files have N% of lines in common, then they are the ‘same’ file.
There is an explicit git mv {src} {dst}
command, but it is effectively an alias for git rm {src}; mv {src} {dst}; git add {dst}
This approach has some advantages and some disadvantages. The most significant advantage is that development tools do not need to be aware of the version control system; they can move files around using normal file operations, and Git will figure out at commit-time whether “renames” have taken place. This is significantly better than Subversion for example, which gets very confused if tools move files without explicitly using svn mv
(also known as svn rename
).
The Index (aka the Staging Area)
When “git checkout” is run for the first time, Git starts from a “root” tree object in the repository, and iterates over its entries, turning the “blob” objects it points to into files, and recursively processing “tree” objects. At the same time it builds an “index” file, being a fairly simple list of all the files that it checks out as (filename, checksum, status) tuples; try “strings .git/index” to see all the checked-out files.
Implementing the git status
command (which shows which local files have been modified since they were checked out) is simply a matter of iterating over the contents of the working directory and comparing it with the index; any files on disk whose checksums don’t match their corresponding entry in the index are “locally modified and not added”.
To include a changed or new file in a commit, the git add
command is needed. This immediately stores the current state of the specified file as a blob in the object store, and then updates the entry for the file’s pathname in the index file to specify the new status (staged for commit) and the checksum of the new blob object. A git commit
then can quickly find the files that have been modified; it must create new tree objects for each directory containing modified files (the “bubble up” process described earlier), but the blobs representing files themselves are already in the repository.
The git reset
command has many purposes, but one is to restore modified entries in the index file to the original values. If a form of the command which does not affect the working copy is used, then it effectively reverses the effect of git add
- ie sets the index back to original state but leaves local modifications in place.
The effect of the index is to act as a “staging area” where you can choose exactly which modified files from the working copy will be included in the next commit; very useful for deliberately not including debugging changes or changes intended for inclusion in different commits.
If git add
is used several times, then “orphaned” blob objects will exist within the database (containing the intermediate forms of the file). These will get cleaned up during Git’s next “garbage collection” run.
As noted earlier, git checkout {branch}
can be very quick when switching between related branches because Git knows what the difference between two versions are, and can skip a lot of processing. The index file helps here, as Git knows exactly what files it already has written into the local directory.
The Reflog
Git keeps an “audit trail” that tracks changes to all references, ie all those files under .git that hold either a checksum or the name of another reference-file, like:
- the HEAD reference (which points to the currently checked-out branch, eg
.git/refs/heads/master
). - local branch references (which contain the checksum of the “tip” commit on that branch, eg
.git/refs/heads/master
) - remote branch references
- tag references
Examples of operations that change a reference file are:
-
git commit
: updates the reference file for the currently-checked-out branch to point to the new commit -
git checkout foo
: updates the HEAD reference file to point to thefoo
reference file -
git fetch origin
: after copying objects from the remoteorigin
repo to the local one, this updates the files underrefs/remotes/origin/
.
Interestingly, commits are recorded not only in the reflog information for the current branch, but also in the reflog information for the HEAD reference (even though technically the HEAD reference value didn’t change; it still just has the name of the current branch).
The logs are useful because reference files aren’t themselves versioned (there is only one copy, unlike with objects). Various Git helper commands use the reflog to help you recover from accidental errors; as long as the objects are still available, the checksums in the logs can help track things down. This history information can also be used for interesting statistical analyses and similar purposes. In fact, the ways the reflog can be used is rather impressive - and very complicated. See descriptions in some of the excellent online Git references (eg those listed in the References section of this article).
The reflog is not shared across repositories; it is just for the local repository.
Entries in the reflog are kept for a couple of months by default, and when doing “garbage collection” of objects, Git considers objects pointed to by entries in the reflog to still be “referenced”, ie are not to be garbage-collected. Therefore when a branch is deleted, the commits that are on that branch are not garbage-collected for at least a couple of months. To really discard objects, see the documentation for git prune
and particularly the --no-reflogs
option for git fsck
(which git prune
invokes).
Note that when a branch is deleted (git branch -D bar
), then the reflog information for that branch is also immediately deleted. This makes restoring a deleted branch a little trickier than it could have been. Hopefully you recently switched to that branch and did a commit; in this case, the reflog for the HEAD reference will have information about that commit, and you can simply recreate the branch by providing the commit’s checksum as a commandline parameter.
Git Log
The git log
command shows the history for any specific commit (HEAD, ie the tip commit of the current branch is the default). When showing “linear” history, the behaviour is obvious. However when non-fastforward-merges occur, behaviour gets a little tricky and the output may come in an unexpected order.
Assume the repository looks like the following:
A = B = C = D = G = H
\ /
\= E = F /
By default, git log
lists all commits in the date-order in which they were created. So whether the order is H-G-D-C-F-E-B-A or H-G-F-E-D-C-B-A or even H-G-D-F-C-E-B-A depends upon the “committed at” date stored within each commit. And when commits are being transferred around between repositories, the date never changes : it is always the date at which the commit was made to its original repository (otherwise, the commit checksum would change).
The git log
command has dozens of options for messing with the displayed data and the order in which it is output. See the manuals for full details.
Combining Patches and Rebasing
If you have created a branch and added half-a-dozen commits to it, then want to merge back into a master branch but don’t want the full workings to be seen, then you can use “interactive rebasing” to reorder or merge various commits.
You may also have created a branch some time ago, and want to submit commits based on more recent code. This can be done with git rebase
, which effectively takes the changes on the current branch, and then recreates the branch based on the tip of the specified “rebase onto” branch and reapplies the patches.
Rebasing can be very useful, but must only be used on patches that have never been exported to another repository; rewriting public history causes all sorts of confusion.
Git’s rebase algorithm is roughly as follows:
- create temporary reference ORIG_HEAD pointing to head of target branch
- for each commit in target branch which is not yet in source branch
- take the associated patch from the commit and store it as a file
.git/rebase-apply/NNN
- take the associated patch from the commit and store it as a file
- set HEAD to checksum of the head commit of source branch (aka “upstream” because rebasing is usually done when this branch is maintained by someone else, and you’ve branched it to do some work and now want to submit “clean” patches back).
- check out HEAD commit (effectively puts working copy in “detached head mode”, as HEAD contains a checksum not a branch-name)
- for each patch file
.git/rebase-apply/NNNN
- apply patch to HEAD
- if conflict occurred, then pause the rebase, let the user resolve the conflicts (ie edit the patch) and then run rebase –continue.
- create a commit with the current change using HEAD as the parent, then set HEAD = new commit (ie normal commit behaviour). This effectively create a new “anonymous” branch, starting from the current head of the source branch.
- after all patches are applied then set target branch to reference the current HEAD commit, and set HEAD to reference the target branch. This effectively discards the old target branch and instead makes the same name point to a new branch that splits off from source branch at its current HEAD and contains the same patches as the old branch (tweaked interactively as needed).
- delete the
.git/rebase-apply
directory.
The ORIG_HEAD file still exists and points to the head of that old (obsolete) branch, just in case something went wrong; you could for example copy this value into the target branch reference file to “cancel” the rebase. Or create a new tag that references that commit, or similar. However this file gets overwritten on the next rebase (and maybe merge too?) so it is only a temporary safety measure. Because it isn’t a “real” reference, it probably doesn’t block garbage-collection of “dead” repository objects either.
Because the target branch reference is not updated until of the temporary references, it is possible at any time that a conflict occurs to:
- skip the current patch (
git rebase --skip
) - cancel the rebasing completely (
git rebase --abort
) - in which case the temporary references are simply discarded; the source and target branch references still point to their original locations.
In fact, simply deleting the .git/rebase-apply
directory and doing git checkout {branch}
is sufficient to “cancel” a rebase - although it is advisable to use the normal commands instead.
Special handling exists to avoid reapplying “cherry-picked” patches; any commit whose patch-file is identical to the patch-file in a commit already in the target branch is skipped, even if the commit-message is different.
Tag Objects
As described above, tag references are simply local (name=>checksum) mappings stored in a local file. These are shared between repositories (copied during git fetch
, optionally uploaded during git push
etc). However they are not versioned objects; when a tag reference is modified there is no way (except possibly looking in a reflog file) to see the previous value. And there is no associated meta-data on them; changes to a tag cannot be traced back to any person or a time.
It is sometimes useful to have tags (ie names for commits) whose creation is properly tracked in the repository history. A special Git “tag” object type exists for this purpose. A tag object has a single parent (like a commit), a name, a description, and optionally a cryptographic signature. Because the signature covers the object, and the object contains the checksum of its parent and its filetree, and the parent/filetree contain checksums of their dependent data items, a single signature really “locks down” the entire history which lead to that point. If you can validate the signature on the tag, then you know who created the tag, when they created the tag, and that the version they tagged has not been modified since the tag was created.
Tag objects are created with git tag -a {name}
or git tag -s {name}
, and an object is created and stored in the object pool. These tags are listed in the output of git tag
just like the “lightweight” tag references, but git show {tagname}
clearly shows the difference: on lightweight tags, what is shown is simply the commit that the lightweight tag points to while for “tag objects”, the tag object data is shown including who, when, and the tagged commit’s checksum.
Tag objects aren’t part of the commit history (ie no commit has a tag as its parent) so they don’t show up in git log
output, but the useful git describe
command will show the most recent tag for the current working copy.
The Stash
As noted in many Git guides, it is possible to ‘temporarily’ save work with the git stash save
command, work on something else, then restore your in-progress work with git stash pop
. What they often don’t bother to mention is that git stash save
simply does a real commit; however rather than moving the “tip” of the current branch to point to that commit (which is what normally happens when a commit is done), it leaves the current branch reference file alone, and simply stores the checksum of the new commit on a stack (well, appends a line to file .git/refs/stash
which gives the effect of a stack). The commit’s name is automatically generated as “WIP on {branchname} : {prevcommit-id} {prevcommit-msg}”.
The commits stored on the stack can be referred to with a slightly odd syntax “stash@{n}”, but otherwise they are rather like ordinary tags; git show stash@{0}
will display the actual commit created. Real tags can be created pointing to them, and similar tricks if you feel like it; you can even check one out directly (but as it isn’t a branch, Git warns about being in a ‘detached head’ state). When git stash pop
is run, it simply restores the index and current workspace using the data from that ‘anonymous’ commit object, and removes the line from .git/refs/stash
. As nothing now points to the commit any more, it is eventually garbage-collected.
Git Security
Because checksums are used, Git is inherently resistant to tampering. Modifying a blob object’s contents without updating its checksum will cause checks to fail during git fetch
. Modifying its checksum to match is simply equivalent to creating a branch - which means the bad code is not on the branch any more, and will be ignored.
Signed tags also help (see above).
While some version control systems provide the ability to restrict read access to some parts of the repository for some users, this is not possible with Git. Firstly, it would destroy one of Git’s fundamental features: that all repositories are complete mirrors of each other, and only a workflow convention marks some repositories as being more important than others. Secondly, it is technically difficult to allow somebody to clone an existing repository, yet not send all of the nodes; each commit references the root of a tree, and at the leaves of the tree are all the “blob” file objects, with checksums that should match their contents. For nodes to not exist, or for their contents to not match their checksums, would cause major problems for Git’s consistency checks. For all these reasons, access control for read operations is not supported; if you need to prevent read access to some files to unauthorised users, then the only option is to put those files in a different Git repository.
However, implementing write access control to parts of a repository is quite possible. A Git repository supports “hook scripts” which are invoked on various operations, including push and pull. A quick search of the internet should locate hook scripts which can be used to prevent remote users from performing a “push” from their repository to the restricted one if any commit includes specific paths.
Git Terminology
The following term definitions are useful
Treeish ==> the checksum of a tree (directory) object
Comparing Git to SVN
Git’s filesystem snapshots and SVN’s tree are fairly similar in many respects. Both have immutable files/directories, and a new commit results in a new tree where changes to leafs “bubble up” to produce a new “root” tree object for the filesystem.
And like Git, an svn “commit” record contains an identifier of the root tree object for the filesystem tree that results from that patch. However unlike Git, SVN’s directory/file identifiers are arbitrary integers, not checksums. So when importing a commit from some other system it is difficult/impossible to know which files have been modified and which already exist in the local repository.
In SVN, creating a branch is equivalent to creating a directory within the filesystem. The disadvantage of this is that there is no “graph” of commits representing the relationship between branches. This in turn makes proper merging difficult.
References
Information about learning Git in general:
-
Git SCM - The official Git manual
-
Git Ready - Tutorial and cookbook for Git, including advanced commands
Information about Git internals:
-
Git for Computer Scientists - eagain.net - a somewhat similar article to this one.