[llvm-dev] [RFC] One or many git repositories?

Fri Jul 22 17:33:24 PDT 2016

On Fri, Jul 22, 2016 at 01:08:18PM -0700, Richard Smith via llvm-dev wrote:
> Having read through the entire thread and thought about this for a while,
> here are my thoughts:
> 
>  * A single monolithic repository has quite a lot of advantages, some
> because of what it is (for instance, you can make atomic cross-project
> commits), and some because of what it isn't (keeping the repositories
> separate creates synchronization problems for version-locked components,
> and it's not clear to me that we have a good answer for these problems)
> 
>  * A single repository from which we can build a complete LLVM toolchain,
> without requiring checking out a dozen components in seemingly-random
> locations, would be valuable. The default behavior for someone checking out
> and building the LLVM project should be that they get a complete,
> fully-functional toolchain.
> 
>  * We need to preserve and maintain the easy ability to mix and match LLVM
> components with other components (other C runtime libraries, C++ ABI
> libraries, C++ standard libraries, linkers, debuggers, ...). That means
> that it needs to be obvious what the boundaries of the optional components
> are, which means that the current project layout (the one implied by the
> build system) is not good enough for a monolithic repository (LLVM tests
> will fail if you don't check out llvm/tools/opt, but we presumably want to
> explicitly support not checking out llvm/tools/clang) -- unless we have
> extensive documentation covering this, and even then there are likely to be
> discoverability issues.
> 
> However, the move to git and the reorganization need not be done at the
> same time, and it seems vastly easier to reorganize *after* we move to a
> monolithic git repository -- it would then be essentially trivial for each
> person with organizational ideas to move the code around in their
> monolithic git repository, push it somewhere where we can all look at it,
> and for us to then make an informed choice about the layout, with a
> concrete example in front of us. Then we push the selected new layout; git
> supports this really nicely if all the parts are already in a single
> repository.
> 

I am also in favor of using a monolithic repo.  We are currently
using the monolithic llvm-project repo[1] for some of our automated
testing, and it is much easier to deal with than the separate repos.
Especially, in our case were we always build a complete toolchain
(for us this means lvm, lld, and clang).

-Tom

[1] https://github.com/llvm-project/llvm-project

> So here's what I would suggest:
> 
> - we move to a monolithic git repository on github
> 
> - this monolithic repository contains all the LLVM subprojects necessary to
> build a complete toolchain, including libc++ and other pieces that are not
> version-locked to llvm or clang
> 
> - the initial structure exactly matches the current layout implied by the
> build system (clang in tools/clang, lld in tools/lld, compiler-rt in
> runtimes/compiler-rt, libc++ in projects/libcxx, and so on)
> 
> - after we transition to git, interested parties assemble and upload to
> github patches reorganizing the project structure, and we have another
> discussion about principles for the restructuring (including forming solid
> guidance for how to organize future additions to LLVM), with reference to
> the patches so we can look at the proposed new layout; we pick one and
> commit it
> 
> The goal would be to have the new layout entirely settled by the time 4.0
> branches.
> 
> On Wed, Jul 20, 2016 at 4:39 PM, Justin Lebar via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> 
> > Dear all,
> >
> > I would like to (re-)open a discussion on the following specific question:
> >
> >   Assuming we are moving the llvm project to git, should we
> >   a) use multiple git repositories, linked together as subrepositories
> > of an umbrella repo, or
> >   b) use a single git repository for most llvm subprojects.
> >
> > The current proposal assembled by Renato follows option (a), but I
> > think option (b) will be significantly simpler and more effective.
> > Moreover, I think the issues raised with option (b) are either
> > incorrect or can be reasonably addressed.
> >
> > Specifically, my proposal is that all LLVM subprojects that are
> > "version-locked" (and/or use the common CMake build system) live in a
> > single git repository.  That probably means all of the main llvm
> > subprojects other than the test-suite and maybe libc++.  From looking
> > at the repository today that would be: llvm, clang, clang-tools-extra,
> > lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
> >
> > Let's first talk about the advantages of a single repository.  Then
> > we'll address the disadvantages raised.
> >
> > At a high level, one repository is simpler than multiple repos that
> > must be kept in sync using an external mechanism.  The submodules
> > solution requires nontrivial automation to maintain the history of
> > commits in the umbrella repo (which we need if we want to bisect, or
> > even just build an old revision of clang), but no such mechanisms are
> > required if we have a single repo.
> >
> > Similarly, it's possible to make atomic API changes across subprojects
> > in a single repo; we simply can't do with the submodules proposal.
> > And working with llvm release branches becomes much simpler.
> >
> > In addition, the single repository approach ties branches that contain
> > changes to subprojects (e.g. clang) to a specific version of llvm
> > proper.  This means that when you switch between two branches that
> > contain changes to clang, you'll automatically check out the right
> > llvm bits.
> >
> > Although we can do this with submodules too, a single repository makes
> > it much easier.
> >
> > As a concrete example, suppose you are working on some changes in
> > clang.  You want to commit the changes, then switch to a new branch
> > based on tip of head and make some new changes.  Finally you want to
> > switch back to your original branch.  And when you switch between
> > branches, you want to get an llvm that's in sync with the clang in
> > your working copy.
> >
> > Here's how I'd do it with a monolithic git repository, option (b):
> >
> >   git commit # old-branch
> >   git fetch
> >   git checkout -b new-branch origin/master
> >   # hack hack hack
> >   git commit # new-branch
> >   git checkout old-branch
> >
> > Here's how I'd do it with option (a), submodules.  I've used git -C
> > here to make it explicit which repo we're working in, but in real life
> > I'd probably use cd.
> >
> >   # First, commit to two branches, one in your clang repo and one in your
> >   # master repo.
> >   git -C tools/clang commit # old-branch, clang submodule
> >   git commit # old-branch, master repo
> >   # Now fetch the submodule and check out head.  Start a new branch in the
> >   # umbrella repo.
> >   git submodule foreach fetch
> >   git checkout -b origin/master new-branch
> >   git submodule update
> >   # Start a new branch in the clang repo pointing to the current head.
> >   git checkout -b -C tools/clang new-branch
> >   # hack hack hack
> >   # Commit both branches.
> >   git commit -C tools/clang # new-branch
> >   git commit # new-branch
> >   # Check out the old branch.
> >   git checkout old-branch
> >   git submodule update
> >
> > This is twice as many git commands, and almost three times as much
> > typing, to do the same thing.
> >
> > Indeed, this is so complicated I expect that many developers wouldn't
> > bother, and will continue to develop the way we currently do.  They
> > would thus continue to be unable to create clang branches that include
> > an llvm revision.  :(
> >
> > There are real simplifications and productivity advantages to be had
> > by using a single repository.  They will affect essentially every
> > developer who makes changes to subprojects other than LLVM proper,
> > cares about release branches, bisects our code, or builds old
> > revisions.
> >
> >
> > So that's the first part, what we have to gain by using a monolithic
> > repository.  Let's address the downsides.
> >
> > If you'll bear with a hypothetical: Imagine you could somehow make the
> > monolithic repository behave exactly like the N separate repositories
> > work today.  If so, that would be the best of both worlds: Those of us
> > who want a monolithic repository could have one, and those of us who
> > don't would be unaffected.  Whatever downsides you were worried about
> > would evaporate in a mist of rainbows and puppies.
> >
> > It turns out this hypothetical is very close to reality.  The key is
> > git sparse checkouts [1], which let you check out only some files or
> > directories from a repository.  Using this facility, if you don't like
> > the switch to a monolithic repository, you can set up your git so
> > you're (almost) entirely unaffected by it.
> >
> > If you want to check out only llvm and clang, no problem. Just set up
> > your .git/info/sparse-checkout file appropriately.  Done.
> >
> > If you want to be able to have two different revisions of llvm and
> > clang checked out at once (maybe you want to update your clang bits
> > more often than you update your llvm bits), you can do that too.  Make
> > one sparse checkout just of llvm, and make another sparse checkout
> > just of clang.  Symlink the clang checkout to llvm/tools/clang.
> > That's it.  The two checkouts can even share a common .git dir, so you
> > don't have to fetch and store everything twice.
> >
> > As far as I can tell, the only overhead of the monolithic repository
> > is the extra storage in .git.  But this is quite small in the scheme
> > of things.
> >
> > The .git dir for the existing monolithic repository [2] is 1.2GB.  By
> > way of comparison, my objdir for a release build of llvm and clang is
> > 3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
> > 0.65G.
> >
> > If the 1.2G really is a problem for you (or more likely, your
> > automated infrastructure), a shallow clone [3] takes this down to 90M.
> >
> > The critical point to me in all this is that it's easy to set up the
> > monolithic repository to appear like it's a bunch of separate repos.
> > But it is impossible, insofar as I can tell, to do the opposite.  That
> > is, option (b) is strictly more powerful than option (a).
> >
> >
> > Renato has understandably pointed out that the current proposal is
> > pretty far along, so please speak up now if you want to make this
> > happen.  I think we can.
> >
> > Regards,
> > -Justin
> >
> > [1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
> > info, see
> > http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/
> > .
> > As far as I can tell, sparse checkouts work fine on Windows, but you
> > have to use git-bash, see http://stackoverflow.com/q/23289006.
> > [2] https://github.com/llvm-project/llvm-project
> > [3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev