[llvm-dev] [RFC] One or many git repositories?

Mon Jul 25 17:31:19 PDT 2016

> 2. Those working on projects *outside* the monolithic repo will get the downsides of both: a monolithic repo that they are only using parts of, and multiple repos that are somehow version-locked.

We've addressed the first downside -- that you have to download a
bigger repo -- extensively above.  The gist is, it's minimal.  The
additional disk space is very small compared to e.g. the size of an
llvm objdir.  If you don't want to look at the files from the projects
you don't hack on, you can hide them using git sparse checkouts.  And
obviously we're not going to make everybody build every project --
we'll change the build system so it doesn't use "directories present
in tree" to determine which projects to build by default.

With respect to the second part, I agree that the monolithic
repository wouldn't be a big help for people who are consuming llvm
externally, and further that the extra bits consumed by our new
repository would have a nonzero cost.  But the downsides are so small
I think it would be a serious mistake to weigh this heavily.

> I'd be interested in hearing via the survey which path (separate repos vs. monolithic) causes the most workflow disruption.

As phrased this is begging the question.

The question is, what choice is best?  One dimension of "best"
certainly is "minimizes workflow disruption."  But that's not the only
one, nor even (necessarily) the most important one.  Certainly we
don't have to send out a survey to conclude that the non-monolithic
repository would change workflows the least.  :)

If we really are going to use a survey as an important signal in the
decision here, I would want us to be thoughtful about how we design
it, so that we have a chance of getting informed -- rather than gut --
opinions.  At the very least I'd want every option to be accompanied
by pros/cons assembled by people who advocate for the position.

But personally I'd much prefer if we could just engage here as much as
possible.  We have changed a number of peoples' minds, and that's just
not possible with a survey.

On Mon, Jul 25, 2016 at 3:04 PM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:
> A couple of points that I haven't seen raised yet (I'm mid-vacation so this is pretty-much a fly-by; sorry if I missed these earlier in the thread).
>
> I haven't thought about this as deeply as the rest of you.  Maybe these are easily refuted?
>
> 1. If there's a move toward a monolithic repository, it's important to remember this is the LLVM project, not the Clang project.  A nested layout that optimizes for Clang developers at the expense of everyone else's workflow would be a disservice to the greater community, even if it's temporary "while we figure out what really makes sense" kind of state.  For that reason, I'm against a monolithic layout that has "clang" living at "tools/clang"... or, really, having any sub-project live inside of the "llvm" directory.
>
> 2. Those working on projects *outside* the monolithic repo will get the downsides of both: a monolithic repo that they are only using parts of, and multiple repos that are somehow version-locked.
>
> 3. For many (most?) developers, changing to a monolithic git repo is a *bigger* workflow change than switching to separate git repos.  Many people (and at least some downstream infrastructure) use the git mirrors exclusively, aside from git-svn for committing.
>
> #1 and #2 don't negatively impact Clang developers really -- and we have the loudest voices -- but we should be intentional about any changes here.
>
> I mention #3 specifically to address Richard's claim that a monolithic, nested, git repo is a smaller change than separate git repos.  On the contrary, with separate git repos, I just need to update a couple of remotes and I'm finished.
> - If "minimize incremental change" is important, we should start with separate git repos (since only SVN users need to change their workflow).
> - If "minimize number of changes" is important, we should figure out a close approximation of the end goal and move directly there.
>
> Two specific replies below.
>
>> On 2016-Jul-25, at 12:54, Justin Lebar via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> Hi, all.
>>
>> I feel like we've strayed pretty far from the question originally
>> posed in this thread.
>>
>> One of the pieces of feedback I got before I started this thread was
>> that many people felt that, the last time the question of multiple
>> repos vs. monorepo was discussed, it was interspersed with other
>> topics, making it difficult for some people to weigh in appropriately
>> (or even to be aware that the discussion was occurring).  I'm afraid
>> that the discussion of github workflows we're having here may cause
>> the same problem.
>>
>> Maybe we can move the discussion about github workflows into a
>> different thread?  Again, I don't mean to stop it, just move it.
>>
>> To re-focus this thread on its original topic: It sounds to me like,
>> broadly speaking, we have consensus on using a single repository.
>
> I'm not convinced.  I'd be interested in hearing via the survey which path (separate repos vs. monolithic) causes the most workflow disruption.
>
>> But
>> there are still some outstanding related questions.  Among these are:
>>
>> 1) Should the repository have "unified history"?  (Meaning, should I
>> be able to check out a single git revision from before the migration
>> and have it contain all of the llvm subprojects?)
>>
>> 2) Should the monorepo have a "nested" repository layout (e.g. clang
>> goes in /tools/clang) or a "flat" layout (clang goes in /clang)?
>>
>> 3) Assuming we want unified history, should the new canonical
>> repository's hashes be based on
>> https://github.com/llvm-project/llvm-project, or should it start
>> afresh?
>>
>> FWIW my answers to these are:
>>
>> 1) Yes to unified history.  The main advantage of non-unified history
>> is that it's easier for people to import old branches -- it's a matter
>> of "git merge" instead of running the git filter-branch script I
>> wrote.  But this is a relatively small (~20 minute) one-time cost to
>> some of us, whereas our repository history is born by all of us
>> forever.  Moreover unified history also helps people with long-running
>> branches, as it lets them check out old versions of their branch and
>> get a compatible version of all of the other llvm subprojects.
>>
>> 2) Yes to nested layout.  I find Chandler and Richard Smith's
>> arguments compelling.
>
> I disagree with having "clang" nested inside "llvm".
>
>> 3) No to basing the new canonical repo on
>> https://github.com/llvm-project/llvm-project.  That repo's history is
>> missing svn revision numbers, and there are enough emails floating
>> around that reference svn revision numbers that I think we need them
>> in our canonical repo.  Also llvm-project/llvm-project has a flat
>> structure, and if we end up going with a nested layout, it would be
>> better to have that layout starting with the first commit.
>>
>> -Justin
>>
>> On Mon, Jul 25, 2016 at 8:10 AM, Bruce Hoult via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>> git-imerge can run an arbitrary script to decide whether a commit is good or
>>> bad. Lack of textual merge conflicts is only the most basic test -- you can
>>> check that it compiles, run tests .. whatever you want and have time to
>>> execute.
>>>
>>> On Tue, Jul 26, 2016 at 2:12 AM, Robinson, Paul via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Renato Golin [mailto:renato.golin at linaro.org]
>>>>> Sent: Monday, July 25, 2016 7:11 AM
>>>>> To: Daniel Sanders
>>>>> Cc: Robinson, Paul; llvm-dev at lists.llvm.org
>>>>> Subject: Re: [llvm-dev] [RFC] One or many git repositories?
>>>>>
>>>>> On 25 July 2016 at 14:55, Daniel Sanders <Daniel.Sanders at imgtec.com>
>>>>> wrote:
>>>>>> I know of a way but it's not very nice. The gist of it is to checkout
>>>>> the
>>>>>> downstream branch just before the bad merge and then merge the first
>>>>>> 100 commits from upstream. If the result is good then merge the next
>>>>>> 100, but if it's bad then 'git reset --hard' and merge 10 instead.
>>>>> You'll
>>>>>> eventually find the commit that made it bad. Essentially, the idea is
>>>>>> to
>>>>>> make a throwaway branch that merges more frequently. I do something
>>>>>> similar to rebase my work to master since gradually rebasing often
>>>>>> causes all the conflicts to go away.
>>>>>
>>>>> This is essentially what git-imerge does, you only need to define
>>>>> "good merge" in the form of a script or CI job.
>>>>>
>>>>> cheers,
>>>>> -renato
>>>>
>>>> Except I understood git-imerge to be looking for physical conflicts,
>>>> not "when did this test start failing."  If it does the latter also,
>>>> that would be awesome.
>>>> --paulr
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>