[llvm-dev] [RFC] One or many git repositories?

Wed Jul 27 13:32:13 PDT 2016

Thanks for elaborating, Chris.

> Case Study 1 - Simple development on a sub-project

I explicitly addressed this workflow in my original e-mail.  I know it
was a while ago, but it sounds like it may be worth a read if you
haven't checked it out.

In the mail I described how to use sparse checkouts to create a
repository structure that functions virtually identically to what you
have today.  It takes a few copy-pastable commands to set up.  If
these few commands are a pain, we can write a script and check it in
to llvm.

> Case Study 2 - Working on a sub-project in isolation across many platforms

I am less clear on what exactly this is about, but it seems to me that
a sparse checkout would mitigate most or all of the issues you raise
here, as well.  Again, a sparse checkout is three copy-pasteable
commands.

> We should be conscious of the impact to downstream users in making infrastructure changes like this.

I agree.  The proposal to continue the read-only llvm-mirror
repositories will help minimize the effect on read-only downstream
consumers.

> I think our loose coupling is a feature even if it makes some workflows harder.

If this is something that you want in your checkout of the monorepo,
it is something you can have using sparse checkouts.  It takes a small
amount of one-time work on your part when you clone the repo.  If it's
a problem, we can reduce to running a single command.

I understand that running a single command still isn't zero cost to
you.  I also understand that you may not see the benefit that others
see in the monorepo.  That's cool.  But those of us who do want a
monorepo have no way to get it today, whereas those who want a
multirepo can get something that behaves very similar by configuring
their monorepo.

On Wed, Jul 27, 2016 at 12:50 PM, Chris Bieneman <beanz at apple.com> wrote:
>
>> On Jul 27, 2016, at 10:21 AM, Justin Lebar <jlebar at google.com> wrote:
>>
>> Thanks for your thoughts, Chris.
>>
>>> As supporting evidence of this, I was discussing this thread yesterday around the office yesterday and had quite a few people responding something along the lines of “they’re proposing what?”.
>>
>> I hope they'll join us in this thread.
>>
>> Ultimately a survey is going to be strongly biased in favor of "don't
>> change anything".  There is a strong psychological bias to weight
>> losses more than gains, so if one doesn't engage with the issue, it's
>> only natural to conclude "keep it as similar as possible to what it is
>> today -- that is safe."  But that line of thinking does not
>> necessarily lead us to the best outcome.
>
> I don’t agree with this assertion. I believe that if you put forth multiple proposals, and have an articulate discussion of the merits and costs of each solution you can create a survey that can help inform decision making. I suppose we can agree to disagree.
>
>>
>> We've heard in thread from a lot of developers about how a monorepo
>> would improve their workflow.  I would love to hear from some
>> developers who are actually affected in the way you describe, rather
>> than just considering the hypothetical.
>>
>> My expectation is that the effect of the monorepo on said developers
>> would be relatively small -- we're talking about 1gb of disk space.  I
>> understand that there's a "yuck" factor to this, but inasmuch as there
>> aren't other concrete effects, this is just change aversion.  And
>> essentially all of the other effects of the monorepo can be hidden via
>> sparse checkouts, as we've discussed.
>>
>> Maybe I am wrong.  But I don't think we're going to get to the bottom
>> of it without actually engaging with people who are actually affected
>> in the way you posit.
>
> Ok, let me describe a few workflows I’ve used in the last year that are (in my mind) adversely impacted by a mono-repo.
>
> Case Study 1 - Simple development on a sub-project
>
> I build LLVM + Clang + Compiler-RT using the just-built Clang to build Compiler-RT. I iterate on some complicated Compiler-RT changes over a period of a day. Once my Compiler-RT changes are done I rebase the compiler-rt repo, rebuild compiler-rt then commit.
>
> With a mono-repo rebasing the checkout means rebasing the whole tree. So, either I have to wrangle some crazy git or CMake foo, or when I run “ninja compiler-rt” after the rebase it will rebuild LLVM and Clang too. That kinda sucks.
>
> What this example illustrates to me is that today we have loosely coupled projects with an occasional rev lock. Moving to a mono-repo enforces a tight coupling that isn’t strictly required today.
>
> Case Study 2 - Working on a sub-project in isolation across many platforms
>
> I did a lot of work on Compiler-RT last year that had no direct dependency on any other LLVM project. During the development I was working with a Compiler-RT checkout and a build directory of just Compiler-RT. Every once in a while (or every other day as it were) I would make a change that would break a configuration that I wasn’t directly developing on. My workflow for handling those cases was:
>
> (1) Spin up a VM on a VPS that closely matched the configuration I broke
> (2) Checkout Compiler-RT
> (3) Reproduce, debug, fix the failure
> (4) Commit the patch from the VM
>
> In a mono-repository doing this would require checking out *all* sub-projects, not just Compiler-RT. I imagine this probably isn’t a common workflow, but it is one I use that would be adversely impacted by needing to checkout a full LLVM. Now, you might say I could check out the sub-project mirror, but then I can’t commit from the VM, which kinda sucks.
>
>
>>
>>> While admittedly you do get a linear history with using the mono-repository, that isn’t the only way to solve the problem, and I don’t really think that the benefit (not needing to write some tooling) justifies the increased burden applied to contributors that don’t use the full LLVM family of projects.
>>
>> I think the trade-off you're considering here (cost to developers who
>> use llvm plus a version-locked subrepo vs. cost to developers who
>> don't want an llvm clone) is the right one.
>
> I actually think there are *a lot* more considerations we need to be making for an infrastructure change like this. While it is true that our SCM hosting strategy primarily impacts developers, it also impacts our users. We should be conscious of the impact to downstream users in making infrastructure changes like this. That is part of why the idea of a survey holds appeal to me; it would give us the opportunity to get feedback from a much wider audience than the current “people on llvm-dev who haven’t been scared away”.
>
>> But as someone who has
>> extensively used git submodules and repo (a wrapper script), I
>> strongly disagree with the judgement that a monorepo would not be a
>> significant improvement.
>>
>> Our primary disagreement, I think, is over how much cost there is to
>> "writing some tooling".  To me, this is a significant barrier standing
>> in the way of developer productivity.  Here at Google I did a quick
>> survey, and more than half of us don't have scripts of the sort that
>> Justin Bogner described.  We are all just floundering around rebasing
>> clang and llvm until it compiles.  It *sucks*.
>
> I actually think we’re both talking about solutions that require tooling, and while we *could* be disagreeing over how much effort each tooling initiative would require (I think they’re pretty close, so I don’t care to have that argument), my actual disagreement with your proposal is that it is a change that impacts developers and users universally and I don’t think that it is justified. Simply put, I don’t feel that the benefits are substantial enough to warrant the kind of disruptive change you’re proposing.
>
>>
>> I suggest that saying that all of these developers are "doing it
>> wrong" is not helpful.
>
> Maybe I’m missing something, but I don’t think I said anyone was “doing it wrong”. Bisecting across multiple git repositories isn’t a great experience. But neither is bisecting across a half dozen separate folders in an SVN repository. Both the submodule solution and the mono-repo solution solve this problem equivalently well.
>
>>  Not everyone has the git and python/bash chops
>> to write the necessary scripts.  Not everyone has the personality to
>> obsessively script around stuff, or the desire to maintain said
>> scripts.  Not everyone works on llvm/clang so much that it's worth
>> adopting a special-snowflake workflow.  And some of us -- myself
>> included -- have extensive git scripts which work with the standard
>> git workflow but would be completely broken by adding a custom level
>> of indirection around git.
>>
>> When put this way, maybe it's clear that it's actually a niche set of
>> people for whom "script around the brokenness" is a good solution.
>
> I’m not sure what “brokenness” you’re referring to. We have a collection of loosely connected projects by design. As a result of that intentional design certain workflows will be impacted. I don’t think that is brokenness. I think our loose coupling is a feature even if it makes some workflows harder.
>
> -Chris
>
>>
>> As I've said a bunch of times above, we have to weigh a cost paid by
>> all of us every time we type a command that starts with "git" --
>> something we do tens or hundreds of times a day -- versus the one-time
>> cost of asking people to download 1gb of data.
>>
>> On Wed, Jul 27, 2016 at 9:47 AM, Chris Bieneman via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>> I’m just now catching up on this massive thread after being on vacation last
>>> week, and I have a few thoughts I’d like to share.
>>>
>>> First and foremost please don’t consider lack of dissent on the thread as
>>> presence of consensus. The various git-related threads on LLVM-dev lately
>>> have been so active and contentious that I think a lot of people are zoning
>>> out on the conversations. As supporting evidence of this, I was discussing
>>> this thread yesterday around the office yesterday and had quite a few people
>>> responding something along the lines of “they’re proposing what?”.
>>>
>>> I think it would be great for us to have several different proposals for how
>>> the git-transition could work, and have a survey to get people’s opinions. I
>>> know this has been discussed repeatedly, and I want to put in my vote in
>>> favor of having a survey that takes into account multiple different
>>> approaches.
>>>
>>> WRT the actual proposal in this thread, I’m strongly opposed to a
>>> mono-repository. While I understand the argument that the full clone’s cost
>>> on disk space is minimal compared to an LLVM object directory, what about
>>> for contributors that contribute to the smaller runtimes projects but *not*
>>> to LLVM or Clang. A contributor that only contributes to libcxx or
>>> compiler-rt being forced to do a full clone of all the LLVM projects in
>>> order to push a patch kinda sucks.
>>>
>>> I want to point out a few workflows people may not be considering.
>>>
>>> Clang can be built against an installed LLVM. I know this workflow is used
>>> by some people because I’ve broken it in the past and had to fix it. With a
>>> mono-repo this workflow gets a bit more complicated because you’d need to do
>>> sparse checkouts, and it probably means we should just nuke the workflow
>>> entirely because there is no real value added by having it.
>>>
>>> Compiler-RT’s sanitizers are used with GCC; no LLVM required. While for the
>>> common use case maintaining sparse repository mirrors would limit impact of
>>> this on users, should any GCC user want to contribute to Compiler-RT, you’re
>>> forcing them to clone a much larger repository than necessary.
>>>
>>> The same problem with Compiler-RT’s sanitizers also applies to libcxx,
>>> libcxxabi, libunwind, and potentially any other runtime library projects
>>> that we may create in the future.
>>>
>>> Beyond all that I want to point out that the git multi-repository story is
>>> basically the same thing we have today with SVN except for the absence of a
>>> monotonically increasing number that corresponds across repositories. While
>>> admittedly you do get a linear history with using the mono-repository, that
>>> isn’t the only way to solve the problem, and I don’t really think that the
>>> benefit (not needing to write some tooling) justifies the increased burden
>>> applied to contributors that don’t use the full LLVM family of projects.
>>>
>>> I think we have some pretty strong evidence in the form of the github fork
>>> counts (https://github.com/llvm-mirror/) that most people aren’t using all
>>> of the LLVM projects. In fact, by that evidence Clang (the second most
>>> popular project) is forked less than 2/3 as many times as LLVM.
>>>
>>> -Chris
>>>
>>>
>>> On Jul 26, 2016, at 11:31 AM, Renato Golin via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>> On 26 July 2016 at 19:28, Sanjoy Das via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>> Even if it were possible, I would still keep my upstream checkout
>>> separate just as a safety measure, to keep from sending private stuff
>>> upstream by accident.
>>>
>>>
>>> Just FYI, this is our (Azul's) workflow as well, and for similar
>>> reasons.
>>>
>>>
>>> Same here.
>>>
>>> cheers,
>>> --renato
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>