[llvm-dev] [RFC] One or many git repositories?

Wed Jul 20 16:39:44 PDT 2016

Dear all,

I would like to (re-)open a discussion on the following specific question:

  Assuming we are moving the llvm project to git, should we
  a) use multiple git repositories, linked together as subrepositories
of an umbrella repo, or
  b) use a single git repository for most llvm subprojects.

The current proposal assembled by Renato follows option (a), but I
think option (b) will be significantly simpler and more effective.
Moreover, I think the issues raised with option (b) are either
incorrect or can be reasonably addressed.

Specifically, my proposal is that all LLVM subprojects that are
"version-locked" (and/or use the common CMake build system) live in a
single git repository.  That probably means all of the main llvm
subprojects other than the test-suite and maybe libc++.  From looking
at the repository today that would be: llvm, clang, clang-tools-extra,
lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.

Let's first talk about the advantages of a single repository.  Then
we'll address the disadvantages raised.

At a high level, one repository is simpler than multiple repos that
must be kept in sync using an external mechanism.  The submodules
solution requires nontrivial automation to maintain the history of
commits in the umbrella repo (which we need if we want to bisect, or
even just build an old revision of clang), but no such mechanisms are
required if we have a single repo.

Similarly, it's possible to make atomic API changes across subprojects
in a single repo; we simply can't do with the submodules proposal.
And working with llvm release branches becomes much simpler.

In addition, the single repository approach ties branches that contain
changes to subprojects (e.g. clang) to a specific version of llvm
proper.  This means that when you switch between two branches that
contain changes to clang, you'll automatically check out the right
llvm bits.

Although we can do this with submodules too, a single repository makes
it much easier.

As a concrete example, suppose you are working on some changes in
clang.  You want to commit the changes, then switch to a new branch
based on tip of head and make some new changes.  Finally you want to
switch back to your original branch.  And when you switch between
branches, you want to get an llvm that's in sync with the clang in
your working copy.

Here's how I'd do it with a monolithic git repository, option (b):

  git commit # old-branch
  git fetch
  git checkout -b new-branch origin/master
  # hack hack hack
  git commit # new-branch
  git checkout old-branch

Here's how I'd do it with option (a), submodules.  I've used git -C
here to make it explicit which repo we're working in, but in real life
I'd probably use cd.

  # First, commit to two branches, one in your clang repo and one in your
  # master repo.
  git -C tools/clang commit # old-branch, clang submodule
  git commit # old-branch, master repo
  # Now fetch the submodule and check out head.  Start a new branch in the
  # umbrella repo.
  git submodule foreach fetch
  git checkout -b origin/master new-branch
  git submodule update
  # Start a new branch in the clang repo pointing to the current head.
  git checkout -b -C tools/clang new-branch
  # hack hack hack
  # Commit both branches.
  git commit -C tools/clang # new-branch
  git commit # new-branch
  # Check out the old branch.
  git checkout old-branch
  git submodule update

This is twice as many git commands, and almost three times as much
typing, to do the same thing.

Indeed, this is so complicated I expect that many developers wouldn't
bother, and will continue to develop the way we currently do.  They
would thus continue to be unable to create clang branches that include
an llvm revision.  :(

There are real simplifications and productivity advantages to be had
by using a single repository.  They will affect essentially every
developer who makes changes to subprojects other than LLVM proper,
cares about release branches, bisects our code, or builds old
revisions.

So that's the first part, what we have to gain by using a monolithic
repository.  Let's address the downsides.

If you'll bear with a hypothetical: Imagine you could somehow make the
monolithic repository behave exactly like the N separate repositories
work today.  If so, that would be the best of both worlds: Those of us
who want a monolithic repository could have one, and those of us who
don't would be unaffected.  Whatever downsides you were worried about
would evaporate in a mist of rainbows and puppies.

It turns out this hypothetical is very close to reality.  The key is
git sparse checkouts [1], which let you check out only some files or
directories from a repository.  Using this facility, if you don't like
the switch to a monolithic repository, you can set up your git so
you're (almost) entirely unaffected by it.

If you want to check out only llvm and clang, no problem. Just set up
your .git/info/sparse-checkout file appropriately.  Done.

If you want to be able to have two different revisions of llvm and
clang checked out at once (maybe you want to update your clang bits
more often than you update your llvm bits), you can do that too.  Make
one sparse checkout just of llvm, and make another sparse checkout
just of clang.  Symlink the clang checkout to llvm/tools/clang.
That's it.  The two checkouts can even share a common .git dir, so you
don't have to fetch and store everything twice.

As far as I can tell, the only overhead of the monolithic repository
is the extra storage in .git.  But this is quite small in the scheme
of things.

The .git dir for the existing monolithic repository [2] is 1.2GB.  By
way of comparison, my objdir for a release build of llvm and clang is
3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
0.65G.

If the 1.2G really is a problem for you (or more likely, your
automated infrastructure), a shallow clone [3] takes this down to 90M.

The critical point to me in all this is that it's easy to set up the
monolithic repository to appear like it's a bunch of separate repos.
But it is impossible, insofar as I can tell, to do the opposite.  That
is, option (b) is strictly more powerful than option (a).

Renato has understandably pointed out that the current proposal is
pretty far along, so please speak up now if you want to make this
happen.  I think we can.

Regards,
-Justin

[1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
info, see http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/.
As far as I can tell, sparse checkouts work fine on Windows, but you
have to use git-bash, see http://stackoverflow.com/q/23289006.
[2] https://github.com/llvm-project/llvm-project
[3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git