[llvm-dev] [RFC] One or many git repositories?

Tue Aug 9 13:38:24 PDT 2016

> On Aug 9, 2016, at 11:27 AM, Justin Lebar via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
>> (2) If I’m stuck using git-svn I kinda feel like there is no real point in changing anything.
> 
> No real point *for you specifically*.
> 
> But the vast majority of people would not be stuck using git-svn.  And
> in addition the LLVM project would not be stuck using svn, with all
> the baggage, hosting issues, workflow issues (for people other than
> you), etc.
> 
> The bar by which this proposal should be measured is not "is it a net
> gain for beanz?"  :)  I think we'd be thrilled with a "meh" from your
> corner.
Justin, I don’t think this conversation is really going anywhere.  Renato already mentioned talking about this at the conference, and there has also been talk of a survey.  I think we need those to see how the community actually feel about the proposals here.

Chris may be the only vocal advocate of an alternative to your proposal, but then there are people like me who are quiet because we are waiting for the survey to appear.  I would have been much more vocal if I thought we were actually going to adopt the monorepo, but for now I believe it is still only a proposal.

Full disclosure, I don’t want a monorepo.  I think it optimizes for the use case where people want to bisect, and I don’t think its reasonable to push on everyone to have a monorepo for those who want to bisect.  The submodules repo has already been demonstrated as one potential solution to this which would allow those who want to bisect to do so, while everyone else can continue to work more or less as they do today.

In terms of the proposals, I think you, Mehdi, Chris, and a number of others have proven that there is almost no technical solution beyond our reach.  What we do have are proposals which optimize for different use cases.  Given this, I think the most useful thing from my point of view (and hopefully to others) would be for those advocating each different solution to actual give short examples of each of the different use cases and how to support them.  

For example:

Monorepo, pushing a change to compiler-rt:
1: Git commit …
2: Git pull --rebase
3: test
4 a: Git push /* no commits to any other project so the push works */.  Goto 5
4 b: Git push /* someone committed to some other project in the monorepo.  Goto 2 */
5: Done

I know that this example appears negative in the case where someone else committed to another project and a rebase is required, but thats exactly the point.  This is showing that this particular scenario is potentially a problem compared to today and/or other proposals.  A similar workflow could (should) be written for the sparse checkout monorepo, GitHub monorepo with svn, and submodules cases.  The submodules case will likely show that bisecting is more complex than on the monorepo, while pushing is simpler.  

Similarly, the submodules workflow probably isn’t capable of a single commit to llvm and clang in the revlock case while the monorepo is, but we as a community need to decide whether we want to optimize for that or not.  I don’t have any data to suggest that revlock commits are frequent/infrequent or even a problem in general, and I don’t think we should optimize for that case unless its worth doing so.

Only by actually showing the use cases we care about can the community make an educated decision about what these proposals actually mean to our daily workflow.  We can then choose what we are optimizing for.  I personally want to have a very simple list of repo’s to clone from (or just one!) and for pushing to be easy, because those are the actions I perform the most often.  Others will have different use cases they care about and they can choose the proposal which suits them best.

Cheers,
Pete
> 
> On Tue, Aug 9, 2016 at 11:22 AM, Chris Bieneman <beanz at apple.com> wrote:
>> 
>> On Aug 9, 2016, at 10:08 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>> 
>> 
>> On Aug 8, 2016, at 6:02 PM, Chris Bieneman <beanz at apple.com> wrote:
>> 
>> 
>> 
>> On Aug 8, 2016, at 5:09 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>> 
>> 
>> On Jul 27, 2016, at 12:50 PM, Chris Bieneman via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> 
>> 
>> On Jul 27, 2016, at 10:21 AM, Justin Lebar <jlebar at google.com> wrote:
>> 
>> Thanks for your thoughts, Chris.
>> 
>> As supporting evidence of this, I was discussing this thread yesterday
>> around the office yesterday and had quite a few people responding something
>> along the lines of “they’re proposing what?”.
>> 
>> 
>> I hope they'll join us in this thread.
>> 
>> Ultimately a survey is going to be strongly biased in favor of "don't
>> change anything".  There is a strong psychological bias to weight
>> losses more than gains, so if one doesn't engage with the issue, it's
>> only natural to conclude "keep it as similar as possible to what it is
>> today -- that is safe."  But that line of thinking does not
>> necessarily lead us to the best outcome.
>> 
>> 
>> I don’t agree with this assertion. I believe that if you put forth multiple
>> proposals, and have an articulate discussion of the merits and costs of each
>> solution you can create a survey that can help inform decision making. I
>> suppose we can agree to disagree.
>> 
>> 
>> We've heard in thread from a lot of developers about how a monorepo
>> would improve their workflow.  I would love to hear from some
>> developers who are actually affected in the way you describe, rather
>> than just considering the hypothetical.
>> 
>> My expectation is that the effect of the monorepo on said developers
>> would be relatively small -- we're talking about 1gb of disk space.  I
>> understand that there's a "yuck" factor to this, but inasmuch as there
>> aren't other concrete effects, this is just change aversion.  And
>> essentially all of the other effects of the monorepo can be hidden via
>> sparse checkouts, as we've discussed.
>> 
>> Maybe I am wrong.  But I don't think we're going to get to the bottom
>> of it without actually engaging with people who are actually affected
>> in the way you posit.
>> 
>> 
>> Ok, let me describe a few workflows I’ve used in the last year that are (in
>> my mind) adversely impacted by a mono-repo.
>> 
>> Case Study 1 - Simple development on a sub-project
>> 
>> I build LLVM + Clang + Compiler-RT using the just-built Clang to build
>> Compiler-RT. I iterate on some complicated Compiler-RT changes over a period
>> of a day. Once my Compiler-RT changes are done I rebase the compiler-rt
>> repo, rebuild compiler-rt then commit.
>> 
>> With a mono-repo rebasing the checkout means rebasing the whole tree. So,
>> either I have to wrangle some crazy git or CMake foo, or when I run “ninja
>> compiler-rt” after the rebase it will rebuild LLVM and Clang too. That kinda
>> sucks.
>> 
>> What this example illustrates to me is that today we have loosely coupled
>> projects with an occasional rev lock. Moving to a mono-repo enforces a tight
>> coupling that isn’t strictly required today.
>> 
>> Case Study 2 - Working on a sub-project in isolation across many platforms
>> 
>> I did a lot of work on Compiler-RT last year that had no direct dependency
>> on any other LLVM project. During the development I was working with a
>> Compiler-RT checkout and a build directory of just Compiler-RT. Every once
>> in a while (or every other day as it were) I would make a change that would
>> break a configuration that I wasn’t directly developing on. My workflow for
>> handling those cases was:
>> 
>> (1) Spin up a VM on a VPS that closely matched the configuration I broke
>> (2) Checkout Compiler-RT
>> (3) Reproduce, debug, fix the failure
>> (4) Commit the patch from the VM
>> 
>> In a mono-repository doing this would require checking out *all*
>> sub-projects, not just Compiler-RT. I imagine this probably isn’t a common
>> workflow, but it is one I use that would be adversely impacted by needing to
>> checkout a full LLVM. Now, you might say I could check out the sub-project
>> mirror, but then I can’t commit from the VM, which kinda sucks.
>> 
>> 
>> So for the “I spin a VM and want to make a commit but don’t want to download
>> a few hundred MBs with a git clone” story, it turns out that the github
>> bridge with SVN helps to optimize with a “lean” checkout:
>> 
>> I fork the unified repo here:
>> https://github.com/joker-eph/llvm-project/commits/master and then:  svn co
>> https://github.com/joker-eph/llvm-project/trunk/compiler-rt
>> 
>> So that’s a net “no regression” compared to the current state :)
>> 
>> 
>> Is the github SVN interface's "co" magically as fast as a git clone?
>> 
>> 
>> $ time svn co  https://github.com/joker-eph/llvm-project/trunk/compiler-rt
>> ….
>> real 0m8.539s user 0m0.919s  sys 0m1.917s
>> $ time git clone https://github.com/joker-eph/compiler-rt.git
>> real 0m5.487s user 0m1.208s sys 0m0.825s
>> 
>> 
>> That’s actually not terrible! Color me impressed.
>> 
>> 
>> 
>> If not, it is a performance regression because today I use git clone and
>> git-svn on my VMs just like on my physical machines, and either way it adds
>> some crazy complexity.
>> 
>> 
>> No problem, I get it, exactly same workflow as today:
>> 
>> 
>> Yep. Which isn’t bad. I do however have two concerns.
>> 
>> (1) What happens if we move to pull request-based workflows? Do we still
>> support this workflow?
>> (2) If I’m stuck using git-svn I kinda feel like there is no real point in
>> changing anything. I dislike this workflow less than the earlier proposals,
>> but I see no reason to move to this instead of staying on SVN (other than
>> the hosting issues which could be solved in other ways).
>> 
>> -Chris
>> 
>> 
>> # Clone from the single read-only git repo
>> $ git clone https://github.com/joker-eph/compiler-rt.git
>> …
>> # Configure the SVN remote and initialize the svn metadata
>> $ cd compiler-rt
>> $ git svn init https://github.com/joker-eph/llvm-project/trunk/compiler-rt
>> —username=
>> $ git config svn-remote.svn.fetch :refs/remotes/origin/master
>> $ git svn rebase -l
>> ...
>> # Remove and empty file and commit with git
>> $ git rm empty
>> $ git commit -m "remove empty file"
>> # commit/push with svn to the unified git repo
>> $ git svn dcommit
>> Committing to https://github.com/joker-eph/llvm-project/trunk/compiler-rt
>> ...
>> D empty
>> Committed r354148
>> 
>> 
>> Here is the commit:
>> https://github.com/joker-eph/llvm-project/commit/5f7e977c8cf3c33153d91be9b556143b49911ebe
>> 
>> 
>> —
>> Mehdi
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> While admittedly you do get a linear history with using the mono-repository,
>> that isn’t the only way to solve the problem, and I don’t really think that
>> the benefit (not needing to write some tooling) justifies the increased
>> burden applied to contributors that don’t use the full LLVM family of
>> projects.
>> 
>> 
>> I think the trade-off you're considering here (cost to developers who
>> use llvm plus a version-locked subrepo vs. cost to developers who
>> don't want an llvm clone) is the right one.
>> 
>> 
>> I actually think there are *a lot* more considerations we need to be making
>> for an infrastructure change like this. While it is true that our SCM
>> hosting strategy primarily impacts developers, it also impacts our users. We
>> should be conscious of the impact to downstream users in making
>> infrastructure changes like this. That is part of why the idea of a survey
>> holds appeal to me; it would give us the opportunity to get feedback from a
>> much wider audience than the current “people on llvm-dev who haven’t been
>> scared away”.
>> 
>> But as someone who has
>> extensively used git submodules and repo (a wrapper script), I
>> strongly disagree with the judgement that a monorepo would not be a
>> significant improvement.
>> 
>> Our primary disagreement, I think, is over how much cost there is to
>> "writing some tooling".  To me, this is a significant barrier standing
>> in the way of developer productivity.  Here at Google I did a quick
>> survey, and more than half of us don't have scripts of the sort that
>> Justin Bogner described.  We are all just floundering around rebasing
>> clang and llvm until it compiles.  It *sucks*.
>> 
>> 
>> I actually think we’re both talking about solutions that require tooling,
>> and while we *could* be disagreeing over how much effort each tooling
>> initiative would require (I think they’re pretty close, so I don’t care to
>> have that argument), my actual disagreement with your proposal is that it is
>> a change that impacts developers and users universally and I don’t think
>> that it is justified. Simply put, I don’t feel that the benefits are
>> substantial enough to warrant the kind of disruptive change you’re
>> proposing.
>> 
>> 
>> I suggest that saying that all of these developers are "doing it
>> wrong" is not helpful.
>> 
>> 
>> Maybe I’m missing something, but I don’t think I said anyone was “doing it
>> wrong”. Bisecting across multiple git repositories isn’t a great experience.
>> But neither is bisecting across a half dozen separate folders in an SVN
>> repository. Both the submodule solution and the mono-repo solution solve
>> this problem equivalently well.
>> 
>> Not everyone has the git and python/bash chops
>> to write the necessary scripts.  Not everyone has the personality to
>> obsessively script around stuff, or the desire to maintain said
>> scripts.  Not everyone works on llvm/clang so much that it's worth
>> adopting a special-snowflake workflow.  And some of us -- myself
>> included -- have extensive git scripts which work with the standard
>> git workflow but would be completely broken by adding a custom level
>> of indirection around git.
>> 
>> When put this way, maybe it's clear that it's actually a niche set of
>> people for whom "script around the brokenness" is a good solution.
>> 
>> 
>> I’m not sure what “brokenness” you’re referring to. We have a collection of
>> loosely connected projects by design. As a result of that intentional design
>> certain workflows will be impacted. I don’t think that is brokenness. I
>> think our loose coupling is a feature even if it makes some workflows
>> harder.
>> 
>> -Chris
>> 
>> 
>> As I've said a bunch of times above, we have to weigh a cost paid by
>> all of us every time we type a command that starts with "git" --
>> something we do tens or hundreds of times a day -- versus the one-time
>> cost of asking people to download 1gb of data.
>> 
>> On Wed, Jul 27, 2016 at 9:47 AM, Chris Bieneman via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> 
>> I’m just now catching up on this massive thread after being on vacation last
>> week, and I have a few thoughts I’d like to share.
>> 
>> First and foremost please don’t consider lack of dissent on the thread as
>> presence of consensus. The various git-related threads on LLVM-dev lately
>> have been so active and contentious that I think a lot of people are zoning
>> out on the conversations. As supporting evidence of this, I was discussing
>> this thread yesterday around the office yesterday and had quite a few people
>> responding something along the lines of “they’re proposing what?”.
>> 
>> I think it would be great for us to have several different proposals for how
>> the git-transition could work, and have a survey to get people’s opinions. I
>> know this has been discussed repeatedly, and I want to put in my vote in
>> favor of having a survey that takes into account multiple different
>> approaches.
>> 
>> WRT the actual proposal in this thread, I’m strongly opposed to a
>> mono-repository. While I understand the argument that the full clone’s cost
>> on disk space is minimal compared to an LLVM object directory, what about
>> for contributors that contribute to the smaller runtimes projects but *not*
>> to LLVM or Clang. A contributor that only contributes to libcxx or
>> compiler-rt being forced to do a full clone of all the LLVM projects in
>> order to push a patch kinda sucks.
>> 
>> I want to point out a few workflows people may not be considering.
>> 
>> Clang can be built against an installed LLVM. I know this workflow is used
>> by some people because I’ve broken it in the past and had to fix it. With a
>> mono-repo this workflow gets a bit more complicated because you’d need to do
>> sparse checkouts, and it probably means we should just nuke the workflow
>> entirely because there is no real value added by having it.
>> 
>> Compiler-RT’s sanitizers are used with GCC; no LLVM required. While for the
>> common use case maintaining sparse repository mirrors would limit impact of
>> this on users, should any GCC user want to contribute to Compiler-RT, you’re
>> forcing them to clone a much larger repository than necessary.
>> 
>> The same problem with Compiler-RT’s sanitizers also applies to libcxx,
>> libcxxabi, libunwind, and potentially any other runtime library projects
>> that we may create in the future.
>> 
>> Beyond all that I want to point out that the git multi-repository story is
>> basically the same thing we have today with SVN except for the absence of a
>> monotonically increasing number that corresponds across repositories. While
>> admittedly you do get a linear history with using the mono-repository, that
>> isn’t the only way to solve the problem, and I don’t really think that the
>> benefit (not needing to write some tooling) justifies the increased burden
>> applied to contributors that don’t use the full LLVM family of projects.
>> 
>> I think we have some pretty strong evidence in the form of the github fork
>> counts (https://github.com/llvm-mirror/) that most people aren’t using all
>> of the LLVM projects. In fact, by that evidence Clang (the second most
>> popular project) is forked less than 2/3 as many times as LLVM.
>> 
>> -Chris
>> 
>> 
>> On Jul 26, 2016, at 11:31 AM, Renato Golin via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> 
>> On 26 July 2016 at 19:28, Sanjoy Das via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> 
>> Even if it were possible, I would still keep my upstream checkout
>> separate just as a safety measure, to keep from sending private stuff
>> upstream by accident.
>> 
>> 
>> Just FYI, this is our (Azul's) workflow as well, and for similar
>> reasons.
>> 
>> 
>> Same here.
>> 
>> cheers,
>> --renato
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev