[PATCH] D24167: Moving to GitHub - Unified Proposal

Mon Oct 10 11:02:21 PDT 2016

> On 2016-Oct-09, at 23:02, Mehdi Amini <mehdi.amini at apple.com> wrote:
> 
>> 
>> On Oct 6, 2016, at 5:05 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>> 
>>> 
>>> On 2016-Oct-06, at 15:16, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>> 
>>> Hi Duncan
>>> 
>>> Thanks for you great feedback and suggestions!
>>> I integrated all the inline comments I haven’t answered below, and updated the revision on Phab.
>>> 
>>>> On Oct 5, 2016, at 11:26 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>>> 
>>>> I think this proposal is getting close.  The content is great, and the
>>>> workflow examples are really useful.
>>>> 
>>>> The current high-level structure, which interleaves the two main
>>>> competing variants, makes it tough to evaluate the variants in
>>>> isolation.  The current structure is something like this:
>>>> 
>>>> - Introduction
>>>> - What This Proposal is *Not* About
>>>> - Why Git, and Why GitHub?
>>>>    - Why Move At All?
>>>>    - Why Git?
>>>>    - Why GitHub?
>>>>    - On Managing Revision Numbers with Git
>>>>    - What About Branches and Merges?
>>>>    - What About Commit Emails?
>>>> - One or Multiple Repositories?
>>>>    - How Do We Handle A Single Revision Number Across Multiple
>>>>      Repositories?
>>>> - Workflow Before/After
>>>>    - Checkout/Clone a Single Project, without Commit Access
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Checkout/Clone a Single Project, with Commit Access
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Checkout/Clone Multiple Projects, with Commit Access
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Commit an API Change in LLVM and Update the Sub-projects
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Branching/Stashing/Updating for Local Development or Experiments
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Bisecting
>>>>        - Currently
>>>>        - Multirepo Proposal
>>>>        - Monorepo Proposal
>>>>    - Living Downstream
>>>> - Monorepo Variant
>>>> - Previews
>>>> - Remaining Issues
>>>> - Straw-man Migration Plan
>>>> 
>>>> IMO, we can paint a clearer picture of the variants by restructuring
>>>> like this:
>>>> 
>>>> - Introduction
>>>> - What This Proposal is *Not* About
>>>> - Why Git, and Why GitHub?
>>>>    - Why Move At All?
>>>>    - Why Git?
>>>>    - Why GitHub?
>>>>    - On Managing Revision Numbers with Git
>>>>    - What About Branches and Merges?
>>>>    - What About Commit Emails?
>>>> - Straw-man Migration Plan
>>>> - Variant #1: Multirepo: One repository per subproject
>>>>    - Preview
>>>>    - Why is this great?
>>>>    - Some people are afraid because...
>>>>    - Workflow: Checkout/Clone a Single Project, without Commit Access
>>>>    - Workflow: Checkout/Clone a Single Project, with Commit Access
>>>>    - Workflow: Checkout/Clone Multiple Projects, with Commit Access
>>>>    - Workflow: Commit an API Change in LLVM and Update the
>>>>      Sub-projects
>>>>    - Workflow: Branching/Stashing/Updating for Local Development or
>>>>      Experiments
>>>>    - Workflow: Bisecting
>>>>    - Options for Living Downstream
>>>> - Variant #2: Monorepo: One repository, full stop
>>>>    - Preview
>>>>    - Why is this great?
>>>>    - Some people are afraid because...
>>>>    - Workflow: Checkout/Clone a Single Project, without Commit Access
>>>>    - Workflow: Checkout/Clone a Single Project, with Commit Access
>>>>    - Workflow: Checkout/Clone Multiple Projects, with Commit Access
>>>>    - Workflow: Commit an API Change in LLVM and Update the
>>>>      Sub-projects
>>>>    - Workflow: Branching/Stashing/Updating for Local Development or
>>>>      Experiments
>>>>    - Workflow: Bisecting
>>>>    - Options for Living Downstream
>>>> - Variant #3: Hybridrepo: One repository per non-revlocked subproject
>>>>    - Why is this great?
>>>>    - Some people are afraid because...
>>>> 
>>>> Some differences that I'd like to point out:
>>>> 
>>>> - Moved the migration plan up before the variants.  There's really
>>>>  no order dependency, and this gets the common elements out of the
>>>>  way before "the fork".
>>>> - Incorporated remaining issues into the migration plan.
>>>> - Variants #1 and #2 are described without referring to each other,
>>>>  so it's clear exactly how they're different from what we do now.
>>> 
>>> This is tradeoff, we’re loosing the clear difference between the two, which is a primary goal I had in mind when starting this document. I’ll prepare an alternative layout.
>> 
>> As we discussed in person, if you add links to jump back-and-forth the multirepo and monorepo versions of each workflow, the reader can exactly see how they differ.  I think that gets the best of both worlds.
> 
> As we discussed in person, I believe the side-by-side comparison is important, and I don’t want to lose it.
> 
> I quickly tried an alternate layout here: http://htmlpreview.github.io/?https://raw.githubusercontent.com/joker-eph/llvm/githubmove/GitHubMove.html

This is quite hard to read.  With the horizontal scrolling, I can only see 1.5 of the options at a time, and it's not quite clear what I'm missing off-screen.

I imagine this would be even worse on a phone, but I haven't tried.

>>>>> +
>>>>> +As another example, some developers think that the division between e.g. clang
>>>>> +and clang-tools-extra is not useful. With the monorepo, we can move code around
>>>>> +as we wish and preserve history.
>>>>> +With the multirepo, refactoring some functions from clang to make it part of a
>>>>> +utility in one of the llvm/lib/Support file to share it across sub-projects
>>>> 
>>>> "...utility in libSupport to share it across sub-projects..."
>>>> 
>>>>> +wouldn't carry the history of the code in the llvm repo, breaking recursively
>>>>> +applying `git blame` for instance.
>>>> 
>>>> "...repo, complicating `git blame`.”
>>> 
>>> Here we can’t blame in the repo where the code is. It is not complicated, but impossible (and why I use “breaking”).
>>> Now, you can blame by pulling the original repo, finding where the code was removed, and start the blame from there.
>> 
>> And you could add tooling to do this, which uses `git blame`.  And, we could have a policy of notating things in commit messages in some way to make the tooling more powerful or efficient.  It's not impossible.
> 
> /me cough cough…

??

>> 
>> It's also not a regression from the current workflow, so it'll be hard to convince me that it breaks anything.
> 
> You can’t recursively apply `git blame` from the repo where you’re looking at the code, that’s what I believe is written, are you disagreeing with this?
> The sentence does not compare to the current situation: it does not say it breaks someone’s current workflow, it says exactly what it breaks. I can write “prevent” as well if that makes you feel more comfortable.

What I'm disagreeing with, on the contrary, is the habit of talking about the benefits of monorepo in both the monorepo section (as a benefit!) and in the multirepo section (as a downside!).

Assuming the monorepo does a great job on something:

1. If the current system does well and the multirepo doesn't, you should mention it (somewhere) as a downside of multirepo.  It's not relevant for monorepo at all.

2. If the current system does poorly and the multirepo is the same, you should mention it (somewhere) as a benefit of monorepo.  It's not relevant for multirepo at all.

This is a case of #2.
- I strongly disagree with talking about it in multirepo.  That gives an initial, misleading impression that it's a regression from the current system.  It also duplicates an argument, which makes it hard to understand the full set of arguments in the document.
- I also disagree with talking about it as a downside of the current system.  Compared to the current text, at least the first impression wouldn't be misleading.  However, it's still duplicating the argument.  I'd like to clarify this document by repeating things as little as possible.

>>> “complicating” on the other hand is a bit subjective, and also does not totally capture this. Do you have better?
>> 
>> The word "complicating" may not be precise, but it's accurate and impossible to contradict: 
> 
> I don’t understand how something can be subjective and “accurate” or “impossible to contradict”. That by itself seems contradictory to me.

I disagree that it's subjective whether blaming code with git blame is more complicated with multiple repos vs a single repo.  Yes, you can use tooling to make it better.  No, you can't just run git blame.

> Since you mention earlier that some tooling can be made, one can argue that it is not more “complicated” for the user since the tooling handles it for you.

It feels like you're twisting my language.  Are you actually arguing that?

>> it's more complicated to track code using `git blame` when it moves between repositories.
>> 
>>>>> +For example, a given version of clang would be
>>>>> +*<LLVM-12345, clang-5432, libcxx-123, etc.>*.
>>>>> +
>>>>> +To make this more convenient, a separate *umbrella* repository would be
>>>>> +provided. This repository would be used for the sole purpose of understanding
>>>>> +the approximate sequence (commits from different repositories pushed around the
>>>>> +same time can appear in different orders) in which commits were pushed to the
>>>> 
>>>> I assume tooling could make this exact via timestamps, if we cared to
>>>> make it exact.  I suggest simplifying to: "...understanding the sequence
>>>> in which commits were pushed…".
>>> 
>>> Unfortunately, tooling can’t without infinite look ahead, because one can push two commits into two repos in different order than their timestamp. The tooling would see the first push, integrate it, and only then see the other push with an older timestamp.
>>> 
>>> Also, even with an infinite look ahead, you can’t handle:
>>> 
>>> 1) commit API change into LLVM
>>> 2) commit API change into clang
>>> 3) push to LLVM, fail because non-FF. 
>>> 4) pull/rebase LLVM -> timestamp change.
>>> 5) push LLVM
>>> 6) push clang, FF. (with a timestamp that is now older than the LLVM one).
>>> 
>>> The tooling pull both Clang and LLVM, order the commits per timestamps, and integrate Clang before LLVM.
>>> 
>>> Consider now the issue with revision numbers. For example when I commit an API change in LLVM and the fix in clang right after, I know at the time I push the revision the fix is in. We can still provide this with the monorepo since everything is in sync. But consider the story with the multi-repo: you need to keep track of a *tuple*. You pushed revisions <llvm-13452 , clang-2432> and the bot may fail with <llvm-13451, clang-2432> because the integration is done the other way.
>> 
>> 1. I'm not convinced this couldn't be solved *somehow* with tooling.
> 
> That sounds quite hand-wavy to me.
> 
> I believe we need server-side support to implement this properly (git push hooks), and unfortunately GitHub does not offer this possibility. With another provider this wouldn’t be an issue of course.
> 
> Did you write this *before* realizing the issue with the timestamp and the issue with the lack of server-side hooks?

Of course it's hand-wavy.  I haven't written the tooling.

Nevertheless, I strongly believe there's a way to make the umbrella repo, in practice, roughly as good for bisecting as the current system.

>>  Maybe it would be hard, even hard enough we wouldn't do it; in that case, call that out in a contentious-issues-with-multirepo section.
>> 2. I doubt this would be a measurable regression vs. the current practice.
> 
> I don’t really understand this sentence. We have a “perfect” ordering today with the SVN numbering.

In practice people frequently break the build with a commit in one repo, and fix it after with a commit to another repo.  The ordering is maintained by the SVN revisions, but in practice that ordering is only useful outside of a 2(?) minute window.

The bisect script needs a does-this-build? check in the current system as much as for multirepo.  I don't see a difference in practice.

> — 
> Mehdi
> 
> 
>>  If monorepo brings a huge improvement, call that out in the monorepo section.
>> 
>> More on this below.
>> 
>>>>> +**Multirepo Proposal**
>>>>> +
>>>>> +The multirepo works the same as the current Git workflow: every command needs
>>>>> +to be applied to each of the individual repositories. However, in case the
>>>>> +umbrella repository is checked out, `git submodule foreach` allows to replicate
>>>>> +a command on all the individual repositories (or submodules in this case):
>>>> 
>>>> This seems simpler: "The multirepo is similar to the current Git
>>>> workflow.  However, the umbrella repository makes this easy using `git
>>>> submodule foreach`.”
>>> 
>>> The “However…”  without spelling out “every command needs to be applied…” sound curious to me, I rather leave this explicit, that seems to read better to me:
>>> 
>>> The multirepo works the same as the current Git workflow: every command needs
>>> to be applied to each of the individual repositories. 
>>> However, the umbrella repository makes this easy using `git submodule foreach`
>>> to replicate a command on all the individual repositories (or submodules
>>> in this case):
>> 
>> I think the multi-repo proposal is stronger if we argue for people to use `git submodule` commands as the default workflow.  In this world, we rely on the `submodule` porcelain; the umbrella repo is the "real" repo and most commands are run on it.  As far as this particular workflow is concerned, the presence of multiple repos under the hood (each receiving the same command magically from the `submodule` porcelain) seems like an implementation detail.
> 
> How do you commit? Change clang and llvm and run `git submodule foreach git commit`: that’ll popup two commits windows in a row
> Another example is: `git submodule foreach git pull` ; now imagine that there are conflict on the merge in the various sub-repo? I don’t know how the sequence would go but I would find it confusing to handle.
> 
> I’d need to practice a bit more with the submodules, and see how the umbrella behaves during the development (i.e. the hashed in the subrepo changes) to be able to phrase something.

This is another place where you seem to citing a benefit of the monorepo variant when describing the multirepo variant.  In practice, for most workflows, the umbrella repo is a bonus that simplifies dealing with multiple repos, making multirepo better/easier than the current system.  In some cases, it's worse than the SVN backing that it replaces.  That's the main thing the document should be explaining here; how this variant is different from the current system.