[llvm-dev] [RFC] One or many git repositories?

Fri Jul 22 13:08:18 PDT 2016

Having read through the entire thread and thought about this for a while,
here are my thoughts:

 * A single monolithic repository has quite a lot of advantages, some
because of what it is (for instance, you can make atomic cross-project
commits), and some because of what it isn't (keeping the repositories
separate creates synchronization problems for version-locked components,
and it's not clear to me that we have a good answer for these problems)

 * A single repository from which we can build a complete LLVM toolchain,
without requiring checking out a dozen components in seemingly-random
locations, would be valuable. The default behavior for someone checking out
and building the LLVM project should be that they get a complete,
fully-functional toolchain.

 * We need to preserve and maintain the easy ability to mix and match LLVM
components with other components (other C runtime libraries, C++ ABI
libraries, C++ standard libraries, linkers, debuggers, ...). That means
that it needs to be obvious what the boundaries of the optional components
are, which means that the current project layout (the one implied by the
build system) is not good enough for a monolithic repository (LLVM tests
will fail if you don't check out llvm/tools/opt, but we presumably want to
explicitly support not checking out llvm/tools/clang) -- unless we have
extensive documentation covering this, and even then there are likely to be
discoverability issues.

However, the move to git and the reorganization need not be done at the
same time, and it seems vastly easier to reorganize *after* we move to a
monolithic git repository -- it would then be essentially trivial for each
person with organizational ideas to move the code around in their
monolithic git repository, push it somewhere where we can all look at it,
and for us to then make an informed choice about the layout, with a
concrete example in front of us. Then we push the selected new layout; git
supports this really nicely if all the parts are already in a single
repository.

So here's what I would suggest:

- we move to a monolithic git repository on github

- this monolithic repository contains all the LLVM subprojects necessary to
build a complete toolchain, including libc++ and other pieces that are not
version-locked to llvm or clang

- the initial structure exactly matches the current layout implied by the
build system (clang in tools/clang, lld in tools/lld, compiler-rt in
runtimes/compiler-rt, libc++ in projects/libcxx, and so on)

- after we transition to git, interested parties assemble and upload to
github patches reorganizing the project structure, and we have another
discussion about principles for the restructuring (including forming solid
guidance for how to organize future additions to LLVM), with reference to
the patches so we can look at the proposed new layout; we pick one and
commit it

The goal would be to have the new layout entirely settled by the time 4.0
branches.

On Wed, Jul 20, 2016 at 4:39 PM, Justin Lebar via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Dear all,
>
> I would like to (re-)open a discussion on the following specific question:
>
>   Assuming we are moving the llvm project to git, should we
>   a) use multiple git repositories, linked together as subrepositories
> of an umbrella repo, or
>   b) use a single git repository for most llvm subprojects.
>
> The current proposal assembled by Renato follows option (a), but I
> think option (b) will be significantly simpler and more effective.
> Moreover, I think the issues raised with option (b) are either
> incorrect or can be reasonably addressed.
>
> Specifically, my proposal is that all LLVM subprojects that are
> "version-locked" (and/or use the common CMake build system) live in a
> single git repository.  That probably means all of the main llvm
> subprojects other than the test-suite and maybe libc++.  From looking
> at the repository today that would be: llvm, clang, clang-tools-extra,
> lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
>
> Let's first talk about the advantages of a single repository.  Then
> we'll address the disadvantages raised.
>
> At a high level, one repository is simpler than multiple repos that
> must be kept in sync using an external mechanism.  The submodules
> solution requires nontrivial automation to maintain the history of
> commits in the umbrella repo (which we need if we want to bisect, or
> even just build an old revision of clang), but no such mechanisms are
> required if we have a single repo.
>
> Similarly, it's possible to make atomic API changes across subprojects
> in a single repo; we simply can't do with the submodules proposal.
> And working with llvm release branches becomes much simpler.
>
> In addition, the single repository approach ties branches that contain
> changes to subprojects (e.g. clang) to a specific version of llvm
> proper.  This means that when you switch between two branches that
> contain changes to clang, you'll automatically check out the right
> llvm bits.
>
> Although we can do this with submodules too, a single repository makes
> it much easier.
>
> As a concrete example, suppose you are working on some changes in
> clang.  You want to commit the changes, then switch to a new branch
> based on tip of head and make some new changes.  Finally you want to
> switch back to your original branch.  And when you switch between
> branches, you want to get an llvm that's in sync with the clang in
> your working copy.
>
> Here's how I'd do it with a monolithic git repository, option (b):
>
>   git commit # old-branch
>   git fetch
>   git checkout -b new-branch origin/master
>   # hack hack hack
>   git commit # new-branch
>   git checkout old-branch
>
> Here's how I'd do it with option (a), submodules.  I've used git -C
> here to make it explicit which repo we're working in, but in real life
> I'd probably use cd.
>
>   # First, commit to two branches, one in your clang repo and one in your
>   # master repo.
>   git -C tools/clang commit # old-branch, clang submodule
>   git commit # old-branch, master repo
>   # Now fetch the submodule and check out head.  Start a new branch in the
>   # umbrella repo.
>   git submodule foreach fetch
>   git checkout -b origin/master new-branch
>   git submodule update
>   # Start a new branch in the clang repo pointing to the current head.
>   git checkout -b -C tools/clang new-branch
>   # hack hack hack
>   # Commit both branches.
>   git commit -C tools/clang # new-branch
>   git commit # new-branch
>   # Check out the old branch.
>   git checkout old-branch
>   git submodule update
>
> This is twice as many git commands, and almost three times as much
> typing, to do the same thing.
>
> Indeed, this is so complicated I expect that many developers wouldn't
> bother, and will continue to develop the way we currently do.  They
> would thus continue to be unable to create clang branches that include
> an llvm revision.  :(
>
> There are real simplifications and productivity advantages to be had
> by using a single repository.  They will affect essentially every
> developer who makes changes to subprojects other than LLVM proper,
> cares about release branches, bisects our code, or builds old
> revisions.
>
>
> So that's the first part, what we have to gain by using a monolithic
> repository.  Let's address the downsides.
>
> If you'll bear with a hypothetical: Imagine you could somehow make the
> monolithic repository behave exactly like the N separate repositories
> work today.  If so, that would be the best of both worlds: Those of us
> who want a monolithic repository could have one, and those of us who
> don't would be unaffected.  Whatever downsides you were worried about
> would evaporate in a mist of rainbows and puppies.
>
> It turns out this hypothetical is very close to reality.  The key is
> git sparse checkouts [1], which let you check out only some files or
> directories from a repository.  Using this facility, if you don't like
> the switch to a monolithic repository, you can set up your git so
> you're (almost) entirely unaffected by it.
>
> If you want to check out only llvm and clang, no problem. Just set up
> your .git/info/sparse-checkout file appropriately.  Done.
>
> If you want to be able to have two different revisions of llvm and
> clang checked out at once (maybe you want to update your clang bits
> more often than you update your llvm bits), you can do that too.  Make
> one sparse checkout just of llvm, and make another sparse checkout
> just of clang.  Symlink the clang checkout to llvm/tools/clang.
> That's it.  The two checkouts can even share a common .git dir, so you
> don't have to fetch and store everything twice.
>
> As far as I can tell, the only overhead of the monolithic repository
> is the extra storage in .git.  But this is quite small in the scheme
> of things.
>
> The .git dir for the existing monolithic repository [2] is 1.2GB.  By
> way of comparison, my objdir for a release build of llvm and clang is
> 3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
> 0.65G.
>
> If the 1.2G really is a problem for you (or more likely, your
> automated infrastructure), a shallow clone [3] takes this down to 90M.
>
> The critical point to me in all this is that it's easy to set up the
> monolithic repository to appear like it's a bunch of separate repos.
> But it is impossible, insofar as I can tell, to do the opposite.  That
> is, option (b) is strictly more powerful than option (a).
>
>
> Renato has understandably pointed out that the current proposal is
> pretty far along, so please speak up now if you want to make this
> happen.  I think we can.
>
> Regards,
> -Justin
>
> [1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
> info, see
> http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/
> .
> As far as I can tell, sparse checkouts work fine on Windows, but you
> have to use git-bash, see http://stackoverflow.com/q/23289006.
> [2] https://github.com/llvm-project/llvm-project
> [3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160722/80d80564/attachment.html>