Categories: Programming, Git
The Use Case
I recently had to create a single Git repo (company internal) holding a mirror of several other projects (from Github). While Git can do this, the default behaviour is not optimal; it is difficult for others to view the uploaded branches, and tag names from remote repos are mixed into the primary tag namespace leading to rather confusing results.
To be more specific, I needed to clone a central internal repo, pull down all code from several remote Github repos, and then push all the retrieved code back to the central internal repo where it can be accessed by the internal build tools.
Usually, when internal software depends on an external open-source component, the compiled version (binary) of that component is available via some public artifact repository. A company-internal maven repository manager may be used as a mirror/proxy for external repos, ensuring that a company-local copy exists for every binary artifact used in internal builds. Builds of internal software then remain valid (and reproducable) even if the external repositories disappear (or in the unusual case where a single artifact is removed, or even changed).
However in some cases it is necessary to copy the source code of an external project rather than just the resulting binary:
- some open-source projects do not produce binary artifacts (eg kafka-connect connectors, which is my particular use-case)
- sometimes it is necessary to patch/modify the original code for local purposes.
Basic Git Remote Functionality
Importing code from a remote repo into a local Git repo is trivial: git remote
and git fetch
do the hard work. However there are two problems:
- while
git fetch
on a remote nicely namespaces remote branches, it does not handle remote tags so elegantly, and - remote definitions are inherently local to the repo in which they are declared; you cannot push a remote-definition up to a central repo from which colleagues can pull them. Code, branch and tag definitions can be pushed - but not remote definitions.
When all that is desired is to import a single specific branch or tag from a remote repo into a central (shared) repo then the workflow is fairly easy:
- clone the central repo
- define a remote referencing the desired remote repo
-
git fetch
the required branch or tag - make a suitable local name for that remote branch or tag (ie create an alternative name pointing at the same commit)
- push the alternative name to the central repo
When the goal is to import all of the branches/tags from the remote repo, then push references to those objects into the central repo in a way which is easily readable and useable by colleagues, then things are a little trickier - the standard behavour of git remote
and git fetch
doesn’t work so well for that.
How Git Remote Works
First, an explanation of what git remote
and git fetch
really do…
The command git remote add {remotename} {remoteurl}
defines a local alias for the URL of an external git repository. Technically, it just creates some entries in file .git/config
which look something like this:
[remote "remote1"]
url = git://remotehost/remoterepo
fetch = +refs/heads/*:refs/remotes/remote1/*
The command git fetch {remotename}
will retrieve the HEAD commit of every branch in the remote repository, and of course all commits in their history. For each fetched branch, Git stores the commit-id in a file under .git/refs/remotes/{remotename}/{branchname}
. Running git branch
only shows “local” branches (whose commit-ids are stored under .git/refs/heads/{branchname}
. Running git branch -r
instead shows branches whose commit-ids are stored under git/refs/remotes
; the branches are shown together with the remote-name which shows where the branch was fetched from.
Note that when the local repo was created as a clone of another repo, then there is already one remote defined, named origin
, which is stored in the same way as any other remote.
The Problems with Git Remote
Although git fetch {remotename}
does fetch branches nicely, they cannot be pushed so easily to a central repo. The default behaviour of command git push
is to push the current branch to its “upstream repo”; git push --all
will instead push every local branch (everything under .git/refs/heads
) to the origin repo. However “push all” ignores those remotes obtained via git fetch {remote}
(which are under .git/refs/remotes
). Command git push --mirror
pushes everything under .git/refs/*
to the origin repo, which at first appears to be exactly what we want - but colleagues who simply clone that origin repo, or git fetch
to download from it, will not receive those “remote branches”, only stuff under .git/refs/heads
. It is possible to download such remote-branches from an origin repo, but it is non-trivial and what we really want for this use-case is for others to be able to see the imported copies of remote code without complex steps. In short, remote branches are tricky even in the repo where the remotes were defined, and doubly tricky when trying to share them with others via a shared upstream repo.
In addition, git remote
and git fetch
handle tags in remote repos differently from branches in remote repos. By default git fetch {remotename}
only downloads branches. Command git fetch --all {remotename}
will download both branches and tags - but whereas the remote branches were stored separately from local branches under .git/refs/remotes
, any fetched tags are just mixed in with the local tags. That is good in that pushing the tags to the upstream repo happens easily with git push --tags
. However the bad part is that such tags are not “namespaced” in any way; there is no indication of where they came from. When the tags are just labels like “v1.0” then that is not very helpful. And when the upstream repo should host clones of multiple external projects, then the danger of tagname-clashes is significant.
Note: when fetching from some remote repos, the downloaded code with its branch and tag references can be in “packfiles” under .git/packed/refs
, ie looking in .git/refs
doesn’t always provide the whole picture.
My Solution
My solution for defining a repository containing a clone of multiple external repos is as follows:
- clone the upstream repo
git remote add {remotename} {remoteurl}
git config --unset-all remote.{remotename}.fetch
git config --add remote.{remotename}.tagOpt --no-tags
git config --add remote.{remotename}.fetch +refs/heads/*:refs/heads/{remotealias}/*
git config --add remote.{remotename}.fetch +refs/tags/*:refs/tags/{remotealias}/*
git fetch {remotename}
git push --all
Value remotealias can be the same as remotename if desired.
This solution simply overrides some of the default behaviour of git fetch
. It causes the head of each branch to be fetched, but disables the fetching of tags, and instead of creating the branch under refs/remotes/remotename
it creates it under refs/heads/remotename
. Then it causes all tags to be fetched, and creates the local equivalents under refs/tags/remotename
. Both branches and tags are now namespaced in the same way.
I mentioned earlier that when a single branch or tag from a remote repo is needed, then that specific branch/tag can be fetched, and then a local branch or tag can be made which points at the desired commit. This solution is really just an automated version of that approach - rather than allowing git fetch
to register remote branches under .git/refs/remotes
it causes git-fetch to immediately register them under the normal branch base dir. And rather than allowing git fetch
to register tags under the normal tag base dir, it causes them to be created with a specific prefix. The result is that those imported branches/tags are not “special” like remote-branches are by default, but are normal branches which can be pushed/fetched just like the others - just with special names which indicate their source.
The disadvantage of this approach is that it is not so easy to push changes back to the real upstream project. However that was never part of this use-case.