[llvm-dev] [RFC] One or many git repositories?

Thu Jul 28 14:07:01 PDT 2016

Chris,

What I notice in your latest e-mail -- and I don't know if this is
intentional, so sorry if I'm reading too much into it -- is that the
language has switched from "an unwarranted and unacceptable burden" to
"a burden":

> Also, yes, an extra 400 MB of disk space when the repository for libcxx is only ~20 MB is a big deal to me. You’re not talking about a 10% or 20% increase in repository size, you’re talking about a 20x increase in repository size. That is a burden.
>
> To me, needing to run a script to do sparse checkouts is also a burden. Similarly I think that running a script to bisect a submodule repository (which is my proposal) is also a burden.

I have to admit that I'm pleased to see this softening of language,
because it's much easier for me to agree with this.  Yes, 400mb has a
nonzero cost (although, to nitpick, I don't think that the
multiplicative increase in space is germane).  Yes, running a script
has a nonzero cost.  Totally agree.

It sounds like we also agree that these costs should be weighed
against the benefits accrued to others, and that furthermore that
we'll ultimately want to get wider input from the community about how
much to weigh (to use synecdoche) your 400mb versus my workflow
convenience.

> The only thing a monorepo gets you that strictly isn’t possible without it is the ability to commit to multiple projects in a single commit. Personally I don’t think that is a big enough justification, but that is my opinion, not a fact.

When pushing a set of multiple patches to one subproject, we very
explicitly want every patch in the sequence to build.  So to me, it
seems like we should want this same property for changes that affect
multiple subprojects.  Choosing a repository structure such that it's
impossible to achieve this in general seems Bad.

But that's just me.

> While it is true that Clang developers may want or need the runtime libraries, the runtime library developers frequently don’t need clang. I really don’t want a solution that makes the lives of Clang developers easier at the expense of other subprojects unless it is strictly necessary and for a common “greater good”.

I too want to work towards a common greater good.  We may disagree
about "strictly necessary", but maybe we can set that aside for now.

It does seem that, although you may not be crazy about the monorepo,
you wouldn't come out swinging against it if it didn't include the
runtime libraries.  I'd call that major progress.

On Thu, Jul 28, 2016 at 1:41 PM, Chris Bieneman <beanz at apple.com> wrote:
>
> On Jul 28, 2016, at 12:05 PM, Justin Lebar <jlebar at google.com> wrote:
>
> The decision of whether or not to include these projects
> affects only read-write consumers of these projects -- of which there
> are relatively few people.
>
>
> Maybe there are few, but the impact is non-insignificant. Also I think the
> opinions of the read-write consumers of the sub-projects being included
> should count for a lot
>
>
> I agree.
>
> as a read-write consumer I don’t like this proposal if it includes the
> runtime libraries.
>
>
> Point well-taken.
>
> The existence of subproject mirrors requires someone to write and maintain
> the tooling to keep those mirrors updated,
>
>
> I think you will find on this thread no shortage of people willing to
> maintain said mirrors in exchange for getting a monorepo as the
> canonical source of truth.
>
>
> Ok. Money where my mouth is time.
>
> Submodule repo:
>
> https://github.com/llvm-beanz/llvm-submodules
>
> Bot auto-updating it:
>
> http://beanz-bot.com:8180/jenkins/job/submodule-update/
>
> If we go down this path improvements can be made to the bot so that each
> submodule update commit only includes one submodule update. That would be
> fairly simple to add.
>
>
> and those mirrors will have all the technical hurdles and drawbacks that a
> submodule repository would have.
>
>
> I don't understand this.  The point of the mirrors is to allow people
> to use a read-only multirepo workflow.  I agree that if one chose to
> do so, one would bite all of the drawbacks of a multirepo workflow,
> but...that's the point?  Maybe I'm missing something.
>
>
> What I’m referring to is that since we don’t have the ability to run
> server-side hooks on github the submodule repositories will have some
> complications because they can’t automatically be updated, and the
> infrastructure to do so would have multiple points of failure.
>
> This limitation in github hosting was discussed in at least one of the
> github related threads.
>
>
> The question here is: Do you make downstream single project users work off
> potentially unreliable mirrors, or do you make the people who need a
> mono-repo experience work off a potentially unreliable submodule repo?
>
>
> I agree with the gist of this question, but I want to refine the
> trade-off a bit.
>
> With a monorepo, downstream single-project users actually have two
> options.  They can work off the mirrors, or they can just download the
> whole thing.  So with the monorepo, downstream single-project users
> are not forced to work off noncanonical mirrors.  They are only
> "forced" to do so if they are unable or unwilling to download a 500mb
> repo and throw away most of it.  Which I think may actually be
> relatively few people.  But what do I know?
>
>
> I think we have evidence that many of our projects are used in isolation by
> relatively large numbers of users. Whether or not those users would be
> sufficiently inconvenienced to do something about a mono-repo is a harder
> thing to know.
>
> In the submodule approach this isn’t really an issue because users will
> continue to work as they always have with the per-project repositories, and
> the developers who need bisecting capabilities can clone the submodule repo,
> which can also be used as read-write for making changes to the subprojects.
>
>
> Anyway my answer to this question has been and still is, that a
> monorepo is strictly more powerful than a multirepo.
>
>
> For one thing, we can atomically commit across subprojects using a
> monorepo.  On IRC I've had a bunch of people just begging me for this.
>
> Putative scripts that allow monorepo users to commit to the multirepo
> would not be able to translate cross-cutting commits into a single
> commit in the umbrella repository without cooperation from the script
> that translates commits to the multirepos into commits in the umbrella
> repository (that's the one that contains all the multirepos as git
> subrepositories).  It's possible -- it's turing complete --, but it
> would be very complicated.
>
> Still more complicated would be writing a script that would allow
> monorepo users to push to putative try bots that are based off the
> multirepo.  Again anything is possible, but I have written and
> maintained similar software in the past (for a significantly simpler
> setup) and it was fragile as heck, and again this is going to require
> extensive cooperation between us and the multirepo --> umbrella repo
> script.
>
>
> For cross-repository changes I am fairly certain you could construct
> something that can be pushed to a try bot based on the submodule repository.
> There is no technical reason that shouldn’t work, and I don’t even think the
> scripting around that would be terribly complicated. Admittedly that is more
> complicated than just writing a pull request to a single repository, but I
> suspect not much. I may look into that.
>
>
> In contrast, as discussed earlier, if people want a multirepo-like
> setup based on the monorepo, we can reduce this to a single command
> run once when the repository is cloned.  It ends up being far less
> fragile, and requiring far fewer (actually, zero) tricks on the server
> side.
>
>
> The only thing a monorepo gets you that strictly isn’t possible without it
> is the ability to commit to multiple projects in a single commit. Personally
> I don’t think that is a big enough justification, but that is my opinion,
> not a fact.
>
>
> Instead, let’s talk about DragonEgg.
>
>
> +1.
>
> The DragonEgg project is, as far as I can tell, abandoned, but it is still
> an LLVM project that is tightly coupled to LLVM versions. So it meets
> criteria #1. I think it fails to meet criteria #2 because DragonEgg is
> basically abandoned and provides no real value to the community. Even though
> the burden of a dead project on the mono-repo is minuscule, I think there is
> no good reason to include DragonEgg.
>
>
> If DragonEgg is abandoned, I think we should keep the history in our
> repository and just delete it from head.
>
> My argument for keeping it in our history is: Suppose we go with a
> monorepo, and suppose at some point in the future, some other LLVM
> project -- say, lld -- became abandoned.  Would we rewrite our
> monorepo history to erase all trace of lld, because it no longer
> provides value to us?
>
> No, right?  lld's history is part of our history.  We'd just delete it
> from head and move on with our lives.
>
> My arguments are from the perspective of someone working on the runtime
> library projects, the burden is significant to be included in the llvm
> mono-repo. While the full history of LLVM is around 500MB, the full history
> of *all* the runtime projects is less than 100MB.  Developers working on
> libcxx or compiler-rt should not need to clone LLVM, and run commands to do
> sparse checkouts. That is more burden than we should incur. Further the
> setup cost of doing multiple sparse checkouts in order to approximate the
> workflows we have today with decoupled projects is, IMO, unnecessary and
> unreasonable.
>
>
> OK, just to make sure I understand your point here, because this is
> important, you are saying that you object to including libcxx and
> compiler-rt in the llvm monorepo because:
>
> * It would consume an additional ~400mb of disk space, and
> * It's unnecessary and unreasonable to ask libcxx etc. developers to
> run a script when they check out the monorepo if they want a sparse
> checkout and/or a setup that mirrors the multirepo.
>
>
> I'm not trying to put words in your mouth or subtly change what you're
> saying, so please let me know if I didn't get that right.
>
>
> I have a lot of arguments against the runtime libraries being included.
> First and foremost, they don’t meet the “tightly coupled” criteria. Also,
> yes, an extra 400 MB of disk space when the repository for libcxx is only
> ~20 MB is a big deal to me. You’re not talking about a 10% or 20% increase
> in repository size, you’re talking about a 20x increase in repository size.
> That is a burden.
>
> To me, needing to run a script to do sparse checkouts is also a burden.
> Similarly I think that running a script to bisect a submodule repository
> (which is my proposal) is also a burden. I can’t judge which burden is more
> significant because I don’t know how many people bisect. What I can say is
> that it is my belief that I’m not the only person who works on runtime
> projects in isolation. As a potential example (because I don’t want to put
> words into anyone’s mouth), Marshal Clow and Eric Fiselier have made *a ton*
> of contributions to libcxx over the last year, but neither of them are
> frequent contributors to LLVM or Clang.
>
> While it is true that Clang developers may want or need the runtime
> libraries, the runtime library developers frequently don’t need clang. I
> really don’t want a solution that makes the lives of Clang developers easier
> at the expense of other subprojects unless it is strictly necessary and for
> a common “greater good”.
>
> -Chris
>
>
> Thanks again for all your time here.
>
> -Justin
>
> On Thu, Jul 28, 2016 at 11:28 AM, Chris Bieneman <beanz at apple.com> wrote:
>
>
> On Jul 28, 2016, at 10:53 AM, Justin Lebar <jlebar at google.com> wrote:
>
> Thanks again for your thoughts, Chris.
>
> As a straw man I would suggest the following criteria for inclusion into the
> mono-repo:
>
> (1) Projects in the mono-repo must be tightly coupled to specific versions
> or commits of other projects in the mono-repo
>
>
> I'm fine with that, fwiw.  That was in fact the original proposal.
>
>
> That is the wording of the original proposal, but I disagree that it is the
> content of the original proposal. I don’t believe that Compiler-RT is
> tightly coupled to LLVM at all, which is a big source of my disagreement
> here.
>
> I'm also fine if we decide to put everything inside the monorepo.  I
> think Richard Smith had some good arguments for why they belong
> together.
>
> But I am really surprised that you think this is such a big deal that
> you would object to the whole monorepo if this decision doesn't go
> your way.
>
>
> I really hate your phrasing on this. I’m not objecting to this proposal just
> because some minor decision doesn’t go my way. I think this is a very
> crucial point of whether or not the monorepo solution’s benefit outweighs
> its cost.
>
> The decision of whether or not to include these projects
> affects only read-write consumers of these projects -- of which there
> are relatively few people.
>
>
> Maybe there are few, but the impact is non-insignificant. Also I think the
> opinions of the read-write consumers of the sub-projects being included
> should count for a lot, and as a read-write consumer I don’t like this
> proposal if it includes the runtime libraries.
>
> Read-only consumers *are entirely
> unaffected by the decision*, as they can continue to use the read-only
> subproject mirrors exactly as today.
>
>
> The existence of subproject mirrors requires someone to write and maintain
> the tooling to keep those mirrors updated, and those mirrors will have all
> the technical hurdles and drawbacks that a submodule repository would have.
>
> The question here is: Do you make downstream single project users work off
> potentially unreliable mirrors, or do you make the people who need a
> mono-repo experience work off a potentially unreliable submodule repo?
>
> I think the only answer anyone can reasonably give to this is that we don’t
> have enough information to make a reasonable decision that maximizes the
> benefits to most users while minimizing the adverse impacts. Hence why I
> keep saying we need a survey to understand how *people* interact with the
> project and what kinds of workflows are important. I emphasize the word
> “people” in that last sentence because this decision impacts the
> contributors to the community, and downstream users. We need to take all
> perspectives into account when making this kind of infrastructure decision.
>
>
> (2) The projects in the mono-repo most provide wide benefit to the community
> such that the overall community benefit outweighs the impacts of the project
> being in the repo
> (3) Projects in the mono-repo must conform to some defined set of standards.
> LLVM’s coding standards might be a bit much, but something along those
> lines.
>
>
> Would you mind explaining why you think the criteria for inclusion in
> the monorepo should be different than the criteria for inclusion as an
> LLVM subproject?
>
>
> For starters, including things as LLVM subproject doesn’t require that they
> meet criteria #1 in my proposal. Simply put, they don’t need to be tightly
> coupled to LLVM. We have many examples of that.
>
>
> I think these are fine criteria -- for inclusion of code as an LLVM
> subproject.  But it seems to me -- and maybe I'm wrong -- that the
> reason you're proposing them is that there exist today LLVM
> subprojects that are version-locked to other projects but you think do
> not meet these criteria, and therefore you want to exclude them from
> the monorepo.  Is that right?  lldb comes to mind, as it wasn't in
> your list above.
>
> I understand that lldb is persona non grata in some circles.  But.
> It's not right to use the source code migration as a tool to revisit
> an old decision like this.  That is procedurally unjust.  The relevant
> decision should be, "is LLDB an LLVM subproject that is version-locked
> to other subprojects, or not?”
>
>
> I really don’t want to debate LLDB. It is a hot issue for a lot of people,
> and I’d really prefer if we didn’t start a “let’s all rag on lldb” thread.
>
> Instead, let’s talk about DragonEgg. The DragonEgg project is, as far as I
> can tell, abandoned, but it is still an LLVM project that is tightly coupled
> to LLVM versions. So it meets criteria #1. I think it fails to meet criteria
> #2 because DragonEgg is basically abandoned and provides no real value to
> the community. Even though the burden of a dead project on the mono-repo is
> minuscule, I think there is no good reason to include DragonEgg.
>
> Do you disagree?
>
>
> If you feel strongly that we should reevaluate every project on the
> basis of these last two criteria before including them in the
> monorepo, would you mind elaborating on what exactly are the harms of
> including a project that isn't up to snuff?
>
>
> Every project that is added to the mono-repo will incur a small cost to
> developers in terms of the size it adds to the repository, and the tooling
> or workflow adjustments to handle the change. In most cases this will be
> minimal, even negligible. However I think the burden on runtime developers
> is significant.
>
> If you are aesthetically
> displeased by a project, you can hide it using sparse checkouts.  And
> nobody is going to make you build it.  At that point, the only cost I
> can think of from including a project is the bytes on disk.  But since
> the full history of all LLVM subprojects (excluding test-suite) is
> 500mb (*), surely you're not going to argue for the exclusion of (say)
> lldb on the grounds of saving 25mb (or whatever)?
>
>
> I won’t argue over lldb at all. My arguments are from the perspective of
> someone working on the runtime library projects, the burden is significant
> to be included in the llvm mono-repo. While the full history of LLVM is
> around 500MB, the full history of *all* the runtime projects is less than
> 100MB. Developers working on libcxx or compiler-rt should not need to clone
> LLVM, and run commands to do sparse checkouts. That is more burden than we
> should incur. Further the setup cost of doing multiple sparse checkouts in
> order to approximate the workflows we have today with decoupled projects is,
> IMO, unnecessary and unreasonable.
>
> Those arguments go away if you follow criteria that exclude runtime projects
> from the mono-repo.
>
> -Chris
>
>
> -Justin
>
> (*) I'd called it 1.2gb before, but Bruce Hoult set me straight.
>
> On Thu, Jul 28, 2016 at 10:21 AM, Chris Bieneman via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
>
> On Jul 28, 2016, at 12:59 AM, Renato Golin via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>
> On 28 Jul 2016 8:36 a.m., "David Chisnall via llvm-dev"
> <llvm-dev at lists.llvm.org> wrote:
>
> This does not apply to libc++.  We support building the entire LLVM suite
> with other C++ standard library implementations (at least libstdc++, and I
> think also with Visual Studio’s implementation), so there is no dependency
> of anything on libc++.  Similarly, we support building libc++ with other
> compilers (in FreeBSD, we currently build it with gcc 6.1 for RISC-V, for
> example, where the LLVM toolchain is not quite useable).
>
> The same applies to libunwind, to an even greater degree (where libc++
> implements a standard API, libunwind implements a standard ABI).
>
>
> I think the dependencies of lib* in LLVM are more conceptual than version
> lock, but they're still there.
>
> I agree with you in all other points, mind you, but RT needs an unwind
> library as much as it needs clang. Without them, RT "can" (and indeed does)
> work, but we're not providing a complete solution.
>
> I won't *push* to bundle libunwind, libcxxabi (and ultimately libcxx) on
> those merits alone, but my opinion is that we should. I can't see much use
> in RT without them. That's why we're still defaulting to libgcc on Linux.
>
> Renato, I just want to point out that the Compiler-RT story is *WAY* more
> complicated than it might seem from your comments here. Compiler-RT is
> really two or three conceptually different things that happen to be in the
> same project, and parts of it are very useful without libunwind, libcxxabi,
> and libcxx.
>
> For example, the Compiler-RT sanitizers are used with GCC and libgcc. They
> can be built to be used with libstdc++ as well as libc++ (although I do
> think that loses some features).
>
> I would not object to a mono-repo that included LLVM, Clang, LLD, and
> Clang-Tools-Extra. I strongly object to any mono-repo that includes any of
> the runtime library projects. I also think that once you move away from the
> “mono-repo including all” you need to identify criteria for how you
> determine which projects get included, and potentially how you evaluate
> adding projects to the mono-repo.
>
> As a straw man I would suggest the following criteria for inclusion into the
> mono-repo:
>
> (1) Projects in the mono-repo must be tightly coupled to specific versions
> or commits of other projects in the mono-repo
> (2) The projects in the mono-repo most provide wide benefit to the community
> such that the overall community benefit outweighs the impacts of the project
> being in the repo
> (3) Projects in the mono-repo must conform to some defined set of standards.
> LLVM’s coding standards might be a bit much, but something along those
> lines.
>
> Thoughts?
>
> -Chris
>
> My tuppence.
>
> Cheers,
> Renato
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>