[llvm-dev] [RFC] One or many git repositories?

Fri Jul 22 13:41:19 PDT 2016

Hi, Piotr.

> If you do some light stuff like clang-tidy, that don't often require syncing with clang, but you still want to have the most recent checks, then I don't see a solution in monolithic repository.

Please see my original e-mail, in the paragraph that begins "If you
want to be able to have two different revisions of llvm and clang
checked out at once".

This describes a workflow that would allow you to update clang-tidy
without updating all of llvm.  I think this would address the issue
you raise.

I grant that setting this up would require a one-time but nonzero
amount of work from developers like you.  But then the question is
whether we should optimize for this one-time advantage for a few
developers or advantages for the vast majority of us that affect our
work every day.

-Justin

On Fri, Jul 22, 2016 at 1:22 PM, Piotr Padlewski
<piotr.padlewski at gmail.com> wrote:
> And the same thing happen to IDEs - I would not like to spend next 10-15
> minutes updating symbols in my IDE which would also drain my battery. So
> basically what happens is you pay for what you don't use, which is not C++
> way :P
>
> 2016-07-22 13:18 GMT-07:00 Piotr Padlewski <piotr.padlewski at gmail.com>:
>>
>> I have one reasone why we should not moe to monolithic repository - If you
>> do some light stuff like clang-tidy, that don't often require syncing with
>> clang, but you still want to have the most recent checks, then I don't see a
>> solution in monolithic repository.
>> And this is a real issue if you only have 2 or 4 core laptop to do work.
>> And I guess the the build system won't solve the problem, just a small
>> change in some llvm file will result in recompiling many files that
>> clang-tidy depends on.
>>
>> 2016-07-22 13:08 GMT-07:00 Richard Smith via llvm-dev
>> <llvm-dev at lists.llvm.org>:
>>>
>>> Having read through the entire thread and thought about this for a while,
>>> here are my thoughts:
>>>
>>>  * A single monolithic repository has quite a lot of advantages, some
>>> because of what it is (for instance, you can make atomic cross-project
>>> commits), and some because of what it isn't (keeping the repositories
>>> separate creates synchronization problems for version-locked components, and
>>> it's not clear to me that we have a good answer for these problems)
>>>
>>>  * A single repository from which we can build a complete LLVM toolchain,
>>> without requiring checking out a dozen components in seemingly-random
>>> locations, would be valuable. The default behavior for someone checking out
>>> and building the LLVM project should be that they get a complete,
>>> fully-functional toolchain.
>>>
>>>  * We need to preserve and maintain the easy ability to mix and match
>>> LLVM components with other components (other C runtime libraries, C++ ABI
>>> libraries, C++ standard libraries, linkers, debuggers, ...). That means that
>>> it needs to be obvious what the boundaries of the optional components are,
>>> which means that the current project layout (the one implied by the build
>>> system) is not good enough for a monolithic repository (LLVM tests will fail
>>> if you don't check out llvm/tools/opt, but we presumably want to explicitly
>>> support not checking out llvm/tools/clang) -- unless we have extensive
>>> documentation covering this, and even then there are likely to be
>>> discoverability issues.
>>>
>>> However, the move to git and the reorganization need not be done at the
>>> same time, and it seems vastly easier to reorganize *after* we move to a
>>> monolithic git repository -- it would then be essentially trivial for each
>>> person with organizational ideas to move the code around in their monolithic
>>> git repository, push it somewhere where we can all look at it, and for us to
>>> then make an informed choice about the layout, with a concrete example in
>>> front of us. Then we push the selected new layout; git supports this really
>>> nicely if all the parts are already in a single repository.
>>>
>>> So here's what I would suggest:
>>>
>>> - we move to a monolithic git repository on github
>>>
>>> - this monolithic repository contains all the LLVM subprojects necessary
>>> to build a complete toolchain, including libc++ and other pieces that are
>>> not version-locked to llvm or clang
>>>
>>> - the initial structure exactly matches the current layout implied by the
>>> build system (clang in tools/clang, lld in tools/lld, compiler-rt in
>>> runtimes/compiler-rt, libc++ in projects/libcxx, and so on)
>>>
>>> - after we transition to git, interested parties assemble and upload to
>>> github patches reorganizing the project structure, and we have another
>>> discussion about principles for the restructuring (including forming solid
>>> guidance for how to organize future additions to LLVM), with reference to
>>> the patches so we can look at the proposed new layout; we pick one and
>>> commit it
>>>
>>> The goal would be to have the new layout entirely settled by the time 4.0
>>> branches.
>>>
>>> On Wed, Jul 20, 2016 at 4:39 PM, Justin Lebar via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I would like to (re-)open a discussion on the following specific
>>>> question:
>>>>
>>>>   Assuming we are moving the llvm project to git, should we
>>>>   a) use multiple git repositories, linked together as subrepositories
>>>> of an umbrella repo, or
>>>>   b) use a single git repository for most llvm subprojects.
>>>>
>>>> The current proposal assembled by Renato follows option (a), but I
>>>> think option (b) will be significantly simpler and more effective.
>>>> Moreover, I think the issues raised with option (b) are either
>>>> incorrect or can be reasonably addressed.
>>>>
>>>> Specifically, my proposal is that all LLVM subprojects that are
>>>> "version-locked" (and/or use the common CMake build system) live in a
>>>> single git repository.  That probably means all of the main llvm
>>>> subprojects other than the test-suite and maybe libc++.  From looking
>>>> at the repository today that would be: llvm, clang, clang-tools-extra,
>>>> lld, polly, lldb, llgo, compiler-rt, openmp, and parallel-libs.
>>>>
>>>> Let's first talk about the advantages of a single repository.  Then
>>>> we'll address the disadvantages raised.
>>>>
>>>> At a high level, one repository is simpler than multiple repos that
>>>> must be kept in sync using an external mechanism.  The submodules
>>>> solution requires nontrivial automation to maintain the history of
>>>> commits in the umbrella repo (which we need if we want to bisect, or
>>>> even just build an old revision of clang), but no such mechanisms are
>>>> required if we have a single repo.
>>>>
>>>> Similarly, it's possible to make atomic API changes across subprojects
>>>> in a single repo; we simply can't do with the submodules proposal.
>>>> And working with llvm release branches becomes much simpler.
>>>>
>>>> In addition, the single repository approach ties branches that contain
>>>> changes to subprojects (e.g. clang) to a specific version of llvm
>>>> proper.  This means that when you switch between two branches that
>>>> contain changes to clang, you'll automatically check out the right
>>>> llvm bits.
>>>>
>>>> Although we can do this with submodules too, a single repository makes
>>>> it much easier.
>>>>
>>>> As a concrete example, suppose you are working on some changes in
>>>> clang.  You want to commit the changes, then switch to a new branch
>>>> based on tip of head and make some new changes.  Finally you want to
>>>> switch back to your original branch.  And when you switch between
>>>> branches, you want to get an llvm that's in sync with the clang in
>>>> your working copy.
>>>>
>>>> Here's how I'd do it with a monolithic git repository, option (b):
>>>>
>>>>   git commit # old-branch
>>>>   git fetch
>>>>   git checkout -b new-branch origin/master
>>>>   # hack hack hack
>>>>   git commit # new-branch
>>>>   git checkout old-branch
>>>>
>>>> Here's how I'd do it with option (a), submodules.  I've used git -C
>>>> here to make it explicit which repo we're working in, but in real life
>>>> I'd probably use cd.
>>>>
>>>>   # First, commit to two branches, one in your clang repo and one in
>>>> your
>>>>   # master repo.
>>>>   git -C tools/clang commit # old-branch, clang submodule
>>>>   git commit # old-branch, master repo
>>>>   # Now fetch the submodule and check out head.  Start a new branch in
>>>> the
>>>>   # umbrella repo.
>>>>   git submodule foreach fetch
>>>>   git checkout -b origin/master new-branch
>>>>   git submodule update
>>>>   # Start a new branch in the clang repo pointing to the current head.
>>>>   git checkout -b -C tools/clang new-branch
>>>>   # hack hack hack
>>>>   # Commit both branches.
>>>>   git commit -C tools/clang # new-branch
>>>>   git commit # new-branch
>>>>   # Check out the old branch.
>>>>   git checkout old-branch
>>>>   git submodule update
>>>>
>>>> This is twice as many git commands, and almost three times as much
>>>> typing, to do the same thing.
>>>>
>>>> Indeed, this is so complicated I expect that many developers wouldn't
>>>> bother, and will continue to develop the way we currently do.  They
>>>> would thus continue to be unable to create clang branches that include
>>>> an llvm revision.  :(
>>>>
>>>> There are real simplifications and productivity advantages to be had
>>>> by using a single repository.  They will affect essentially every
>>>> developer who makes changes to subprojects other than LLVM proper,
>>>> cares about release branches, bisects our code, or builds old
>>>> revisions.
>>>>
>>>>
>>>> So that's the first part, what we have to gain by using a monolithic
>>>> repository.  Let's address the downsides.
>>>>
>>>> If you'll bear with a hypothetical: Imagine you could somehow make the
>>>> monolithic repository behave exactly like the N separate repositories
>>>> work today.  If so, that would be the best of both worlds: Those of us
>>>> who want a monolithic repository could have one, and those of us who
>>>> don't would be unaffected.  Whatever downsides you were worried about
>>>> would evaporate in a mist of rainbows and puppies.
>>>>
>>>> It turns out this hypothetical is very close to reality.  The key is
>>>> git sparse checkouts [1], which let you check out only some files or
>>>> directories from a repository.  Using this facility, if you don't like
>>>> the switch to a monolithic repository, you can set up your git so
>>>> you're (almost) entirely unaffected by it.
>>>>
>>>> If you want to check out only llvm and clang, no problem. Just set up
>>>> your .git/info/sparse-checkout file appropriately.  Done.
>>>>
>>>> If you want to be able to have two different revisions of llvm and
>>>> clang checked out at once (maybe you want to update your clang bits
>>>> more often than you update your llvm bits), you can do that too.  Make
>>>> one sparse checkout just of llvm, and make another sparse checkout
>>>> just of clang.  Symlink the clang checkout to llvm/tools/clang.
>>>> That's it.  The two checkouts can even share a common .git dir, so you
>>>> don't have to fetch and store everything twice.
>>>>
>>>> As far as I can tell, the only overhead of the monolithic repository
>>>> is the extra storage in .git.  But this is quite small in the scheme
>>>> of things.
>>>>
>>>> The .git dir for the existing monolithic repository [2] is 1.2GB.  By
>>>> way of comparison, my objdir for a release build of llvm and clang is
>>>> 3.5G, and a full checkout (workdir + .git dirs) of llvm and clang is
>>>> 0.65G.
>>>>
>>>> If the 1.2G really is a problem for you (or more likely, your
>>>> automated infrastructure), a shallow clone [3] takes this down to 90M.
>>>>
>>>> The critical point to me in all this is that it's easy to set up the
>>>> monolithic repository to appear like it's a bunch of separate repos.
>>>> But it is impossible, insofar as I can tell, to do the opposite.  That
>>>> is, option (b) is strictly more powerful than option (a).
>>>>
>>>>
>>>> Renato has understandably pointed out that the current proposal is
>>>> pretty far along, so please speak up now if you want to make this
>>>> happen.  I think we can.
>>>>
>>>> Regards,
>>>> -Justin
>>>>
>>>> [1] Git sparse checkouts were introduced in git 1.7, in 2010. For more
>>>> info, see
>>>> http://jasonkarns.com/blog/subdirectory-checkouts-with-git-sparse-checkout/.
>>>> As far as I can tell, sparse checkouts work fine on Windows, but you
>>>> have to use git-bash, see http://stackoverflow.com/q/23289006.
>>>> [2] https://github.com/llvm-project/llvm-project
>>>> [3] git clone --depth=1 https://github.com/llvm-project/llvm-project.git
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>