[PATCH] D24167: Moving to GitHub - Unified Proposal

Mehdi Amini via llvm-commits llvm-commits at lists.llvm.org
Sun Oct 9 23:02:45 PDT 2016


> On Oct 6, 2016, at 5:05 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
> 
>> 
>> On 2016-Oct-06, at 15:16, Mehdi Amini <mehdi.amini at apple.com> wrote:
>> 
>> Hi Duncan
>> 
>> Thanks for you great feedback and suggestions!
>> I integrated all the inline comments I haven’t answered below, and updated the revision on Phab.
>> 
>>> On Oct 5, 2016, at 11:26 PM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>> 
>>> I think this proposal is getting close.  The content is great, and the
>>> workflow examples are really useful.
>>> 
>>> The current high-level structure, which interleaves the two main
>>> competing variants, makes it tough to evaluate the variants in
>>> isolation.  The current structure is something like this:
>>> 
>>> - Introduction
>>> - What This Proposal is *Not* About
>>> - Why Git, and Why GitHub?
>>>    - Why Move At All?
>>>    - Why Git?
>>>    - Why GitHub?
>>>    - On Managing Revision Numbers with Git
>>>    - What About Branches and Merges?
>>>    - What About Commit Emails?
>>> - One or Multiple Repositories?
>>>    - How Do We Handle A Single Revision Number Across Multiple
>>>      Repositories?
>>> - Workflow Before/After
>>>    - Checkout/Clone a Single Project, without Commit Access
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Checkout/Clone a Single Project, with Commit Access
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Checkout/Clone Multiple Projects, with Commit Access
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Commit an API Change in LLVM and Update the Sub-projects
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Branching/Stashing/Updating for Local Development or Experiments
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Bisecting
>>>        - Currently
>>>        - Multirepo Proposal
>>>        - Monorepo Proposal
>>>    - Living Downstream
>>> - Monorepo Variant
>>> - Previews
>>> - Remaining Issues
>>> - Straw-man Migration Plan
>>> 
>>> IMO, we can paint a clearer picture of the variants by restructuring
>>> like this:
>>> 
>>> - Introduction
>>> - What This Proposal is *Not* About
>>> - Why Git, and Why GitHub?
>>>    - Why Move At All?
>>>    - Why Git?
>>>    - Why GitHub?
>>>    - On Managing Revision Numbers with Git
>>>    - What About Branches and Merges?
>>>    - What About Commit Emails?
>>> - Straw-man Migration Plan
>>> - Variant #1: Multirepo: One repository per subproject
>>>    - Preview
>>>    - Why is this great?
>>>    - Some people are afraid because...
>>>    - Workflow: Checkout/Clone a Single Project, without Commit Access
>>>    - Workflow: Checkout/Clone a Single Project, with Commit Access
>>>    - Workflow: Checkout/Clone Multiple Projects, with Commit Access
>>>    - Workflow: Commit an API Change in LLVM and Update the
>>>      Sub-projects
>>>    - Workflow: Branching/Stashing/Updating for Local Development or
>>>      Experiments
>>>    - Workflow: Bisecting
>>>    - Options for Living Downstream
>>> - Variant #2: Monorepo: One repository, full stop
>>>    - Preview
>>>    - Why is this great?
>>>    - Some people are afraid because...
>>>    - Workflow: Checkout/Clone a Single Project, without Commit Access
>>>    - Workflow: Checkout/Clone a Single Project, with Commit Access
>>>    - Workflow: Checkout/Clone Multiple Projects, with Commit Access
>>>    - Workflow: Commit an API Change in LLVM and Update the
>>>      Sub-projects
>>>    - Workflow: Branching/Stashing/Updating for Local Development or
>>>      Experiments
>>>    - Workflow: Bisecting
>>>    - Options for Living Downstream
>>> - Variant #3: Hybridrepo: One repository per non-revlocked subproject
>>>    - Why is this great?
>>>    - Some people are afraid because...
>>> 
>>> Some differences that I'd like to point out:
>>> 
>>> - Moved the migration plan up before the variants.  There's really
>>>  no order dependency, and this gets the common elements out of the
>>>  way before "the fork".
>>> - Incorporated remaining issues into the migration plan.
>>> - Variants #1 and #2 are described without referring to each other,
>>>  so it's clear exactly how they're different from what we do now.
>> 
>> This is tradeoff, we’re loosing the clear difference between the two, which is a primary goal I had in mind when starting this document. I’ll prepare an alternative layout.
> 
> As we discussed in person, if you add links to jump back-and-forth the multirepo and monorepo versions of each workflow, the reader can exactly see how they differ.  I think that gets the best of both worlds.

As we discussed in person, I believe the side-by-side comparison is important, and I don’t want to lose it.

I quickly tried an alternate layout here: http://htmlpreview.github.io/?https://raw.githubusercontent.com/joker-eph/llvm/githubmove/GitHubMove.html


> 
>> 
>>>  Moreover, each variant makes sense in isolation, independently of
>>>  the other, and can be easily considered in its entirety.
>>> - Added explicit advocacy section, "Why is this great?", to give a
>>>  place to get excited about the variant.
>>> - Added explicit FUD section, "Some people are afraid because...", to
>>>  list concerns that have been raised (and might remain contentious).
>> 
>> I want to keep the document fact-based: FUD makes me think about “rumors” and misleading people. I don’t believe it is what you had mind but I’m worried of slippery slope if we start such a section. I’d like to keep the document as much as possible “fact-based”, and I’ll be careful against adding any mention of "irrational fears".
> 
> The use of "FUD" was casual and misleading on my part.  What I really mean is, "Problems people are worried about, including those that the advocates think can be solved trivially."  Unresolved points of contention.
> 
> Moreover, I'm not convinced a forward-looking proposal can be entirely fact-based; it has to be optimistic somehow about the details.  We're all using unconscious heuristics to skip/gloss over certain things, and their existence and impact are not obvious.
> 
> It's expedient to list the issues-that-advocates-haven't-been-convincing-enough-about, even if the advocates don't understand why their argument is unconvincing.  This gets more people looking at the contentious parts and thinking through the details.  If there are holes in the advocates' arguments we'll be more likely to find them; if not, there will be more voices to convince the naysayers.
> 
>>> - Skipped workflows for Hybridrepo since it's a combination of the
>>>  others.
>> 
>> This point in the list does not seem like a difference with the current one to me? 
>> Either I misunderstand what you’re referring to here, or it just slipped as part of this list without really being a “difference” with the current document but more a property of the document that you want to preserve?
> 
> Just pointing it out.  In the current document, it's fairly hidden.  I think it should be more obvious as a true variant, but I don't think we should go so far as to add workflows.
> 
>>>> +
>>>> +This proposal relates only to moving the hosting of our source-code repository
>>>> +from SVN hosted on our own servers to Git hosted on GitHub. We are not proposing
>>>> +other workflow changes here.  That is, it should not be assumed that moving to
>>> 
>>> Not strictly true.  Both versions of the proposal have workflow changes.
>>> There just aren't review/commit/bug changes.
>> 
>> Well, strictly speaking, the SVN bridge to the monorepo should a zero change in term of workflow.
> 
> The revision numbers will change, invalidating (or making difficult to parse) thousands of comments in PRs and vendor-specific bug reporting tools.  That will require a major workflow change.

Well we have different idea of what’s behind the word “workflow”…

I agree there is a disruption with respect to the history though. We may ought to make it more clear.

> 
> (Also, monorepo-specific benefits should be listed in the monorepo variant section.)
> 
>>> This is hard to articulate precisely.
>>> 
>>> I suggest skipping it, making the two sentences read as follows.  "We
>>> are not proposing using GitHub's issue tracker, pull-requests, or
>>> code-review."
> 
> 
> (I still prefer this wording; haven't looked at your new patch yet, so maybe you did that already?)

(Yes)

> 
>>>> +BitBucket among others) that offer that same service (24/7 stability, disk
>>>> +space, Git server, code browsing, forking facilities, etc) for free.
>>>> +
>>>> +Why Git?
>>>> +--------
>>>> +
>>>> +Many new coders nowadays start with Git, and a lot of people have never used
>>>> +SVN, CVS, or anything else. Websites like GitHub have changed the landscape
>>>> +of open source contributions, reducing the cost of first contribution and
>>>> +fostering collaboration.
>>> 
>>> Is this relevant?  We're not proposing pull requests.
>> 
>> PR is an important part of the GitHub ecosystem, but the online browser and the ability to fork and have visual diff across branches in different fork are other examples of great features.
> 
> Ah, I see.  SGTM.
> 
>>> 
>>>> +
>>>> +What About Branches and Merges?
>>>> +-------------------------------
>>>> +
>>>> +In contrast to SVN, Git makes branching easy. Git's commit history is represented
>>>> +as a DAG, a departure from SVN's linear history.
>>>> +
>>>> +However, we propose to mandate avoiding merge commits in our canonical Git
>>>> +repository.
>>> 
>>> I suggest: "However, this proposal mandates a linear history, making
>>> merge commits illegal in our canonical Git repository.”
>> 
>> 
>>> Should we explain why?
>> 
>> Is “because it is a controversial point” a good enough explanation? ;)
> 
> Yup.  Perhaps: "However, to avoid controversy, the scope of this proposal excludes merge commits and mandates a linear history in our canonical Git repository.”

(Done)

> 
>>>> +
>>>> +As another example, some developers think that the division between e.g. clang
>>>> +and clang-tools-extra is not useful. With the monorepo, we can move code around
>>>> +as we wish and preserve history.
>>>> +With the multirepo, refactoring some functions from clang to make it part of a
>>>> +utility in one of the llvm/lib/Support file to share it across sub-projects
>>> 
>>> "...utility in libSupport to share it across sub-projects..."
>>> 
>>>> +wouldn't carry the history of the code in the llvm repo, breaking recursively
>>>> +applying `git blame` for instance.
>>> 
>>> "...repo, complicating `git blame`.”
>> 
>> Here we can’t blame in the repo where the code is. It is not complicated, but impossible (and why I use “breaking”).
>> Now, you can blame by pulling the original repo, finding where the code was removed, and start the blame from there.
> 
> And you could add tooling to do this, which uses `git blame`.  And, we could have a policy of notating things in commit messages in some way to make the tooling more powerful or efficient.  It's not impossible.

/me cough cough…


> 
> It's also not a regression from the current workflow, so it'll be hard to convince me that it breaks anything.

You can’t recursively apply `git blame` from the repo where you’re looking at the code, that’s what I believe is written, are you disagreeing with this?
The sentence does not compare to the current situation: it does not say it breaks someone’s current workflow, it says exactly what it breaks. I can write “prevent” as well if that makes you feel more comfortable.


> 
>> “complicating” on the other hand is a bit subjective, and also does not totally capture this. Do you have better?
> 
> The word "complicating" may not be precise, but it's accurate and impossible to contradict:

I don’t understand how something can be subjective and “accurate” or “impossible to contradict”. That by itself seems contradictory to me.
Since you mention earlier that some tooling can be made, one can argue that it is not more “complicated” for the user since the tooling handles it for you.


> it's more complicated to track code using `git blame` when it moves between repositories.
> 
>>>> +For example, a given version of clang would be
>>>> +*<LLVM-12345, clang-5432, libcxx-123, etc.>*.
>>>> +
>>>> +To make this more convenient, a separate *umbrella* repository would be
>>>> +provided. This repository would be used for the sole purpose of understanding
>>>> +the approximate sequence (commits from different repositories pushed around the
>>>> +same time can appear in different orders) in which commits were pushed to the
>>> 
>>> I assume tooling could make this exact via timestamps, if we cared to
>>> make it exact.  I suggest simplifying to: "...understanding the sequence
>>> in which commits were pushed…".
>> 
>> Unfortunately, tooling can’t without infinite look ahead, because one can push two commits into two repos in different order than their timestamp. The tooling would see the first push, integrate it, and only then see the other push with an older timestamp.
>> 
>> Also, even with an infinite look ahead, you can’t handle:
>> 
>> 1) commit API change into LLVM
>> 2) commit API change into clang
>> 3) push to LLVM, fail because non-FF. 
>> 4) pull/rebase LLVM -> timestamp change.
>> 5) push LLVM
>> 6) push clang, FF. (with a timestamp that is now older than the LLVM one).
>> 
>> The tooling pull both Clang and LLVM, order the commits per timestamps, and integrate Clang before LLVM.
>> 
>> Consider now the issue with revision numbers. For example when I commit an API change in LLVM and the fix in clang right after, I know at the time I push the revision the fix is in. We can still provide this with the monorepo since everything is in sync. But consider the story with the multi-repo: you need to keep track of a *tuple*. You pushed revisions <llvm-13452 , clang-2432> and the bot may fail with <llvm-13451, clang-2432> because the integration is done the other way.
> 
> 1. I'm not convinced this couldn't be solved *somehow* with tooling.

That sounds quite hand-wavy to me.

I believe we need server-side support to implement this properly (git push hooks), and unfortunately GitHub does not offer this possibility. With another provider this wouldn’t be an issue of course.

Did you write this *before* realizing the issue with the timestamp and the issue with the lack of server-side hooks?


>  Maybe it would be hard, even hard enough we wouldn't do it; in that case, call that out in a contentious-issues-with-multirepo section.
> 2. I doubt this would be a measurable regression vs. the current practice.

I don’t really understand this sentence. We have a “perfect” ordering today with the SVN numbering.

— 
Mehdi


>  If monorepo brings a huge improvement, call that out in the monorepo section.
> 
> More on this below.
> 
>>>> +**Multirepo Proposal**
>>>> +
>>>> +The multirepo works the same as the current Git workflow: every command needs
>>>> +to be applied to each of the individual repositories. However, in case the
>>>> +umbrella repository is checked out, `git submodule foreach` allows to replicate
>>>> +a command on all the individual repositories (or submodules in this case):
>>> 
>>> This seems simpler: "The multirepo is similar to the current Git
>>> workflow.  However, the umbrella repository makes this easy using `git
>>> submodule foreach`.”
>> 
>> The “However…”  without spelling out “every command needs to be applied…” sound curious to me, I rather leave this explicit, that seems to read better to me:
>> 
>> The multirepo works the same as the current Git workflow: every command needs
>> to be applied to each of the individual repositories. 
>> However, the umbrella repository makes this easy using `git submodule foreach`
>> to replicate a command on all the individual repositories (or submodules
>> in this case):
> 
> I think the multi-repo proposal is stronger if we argue for people to use `git submodule` commands as the default workflow.  In this world, we rely on the `submodule` porcelain; the umbrella repo is the "real" repo and most commands are run on it.  As far as this particular workflow is concerned, the presence of multiple repos under the hood (each receiving the same command magically from the `submodule` porcelain) seems like an implementation detail.

How do you commit? Change clang and llvm and run `git submodule foreach git commit`: that’ll popup two commits windows in a row
Another example is: `git submodule foreach git pull` ; now imagine that there are conflict on the merge in the various sub-repo? I don’t know how the sequence would go but I would find it confusing to handle.

I’d need to practice a bit more with the submodules, and see how the umbrella behaves during the development (i.e. the hashed in the subrepo changes) to be able to phrase something.


>>>> +With the `bisect_script.sh` script being::
>>>> +
>>>> +  #!/bin/sh
>>>> +  cd $UMBRELLA_DIRECTORY
>>>> +  git submodule update llvm clang libcxx #....
>>>> +  cd $BUILD_DIR
>>>> +
>>>> +  ninja clang || exit 125   # an exit code of 125 asks "git bisect"
>>>> +                            # to "skip" the current commit
>>>> +
>>>> +  ./bin/clang some_crash_test.cpp
>>>> +
>>>> +When the `git bisect run` command returns, the umbrella repository is set to
>>>> +the state where the regression is introduced, one can inspect the history on
>>>> +every sub-project compared to the previous revision in the umbrella (it is
>>>> +possible that one commit in the umbrella repository includes multiple commits
>>>> +in the sub-projects though it should be occasional in practice).
>>> 
>>> Is the parenthetical note important?
>> 
>> It is especially incorrect at this point: Chris insisted to go with a single-commit umbrella description and I missed this occurence when switching the rest of the document.
>> So now the parenthesis should read "(commits pushed around the same time can appears in undefined order in the umbrella repository)".
> 
> This highlights a wrinkle of switching to Git that I hadn't thought through myself, and maybe it should be discussed above (in the general "switching to Git" sections): the timestamps are client-side, not server-side.  Interesting!  (But probably not a showstopper for anyone.)
> 
> The exact details of the umbrella repository -- perfect push-notifications to force correct ordering vs. infinite timescale based on timestamps vs. aggregated commits -- seem contentious.
> 
> I'm not convinced it's important to nail that down exactly in this proposal.  Moreover, I doubt, whichever tooling we set up, that it will be a true regression vs. our current bisection in practice; and so I don't think it belongs in the discussion of the multirepo bisection workflow.  I argue this is a forum to make clear (a) how multirepo works for bisection, and (b) what's different, if anything, from current practice.
> 
> However, this is worth highlighting elsewhere:
> - as a contentious issue in multirepo: that there are tradeoffs and limitations in the three ways of tooling the umbrella;
> - as a benefit of monorepo: that we have commit merging between sub-projects; and
> - as a note in the monorepo bisection workflow: that the commit merging makes bisection better than current practice.
> 
>> The importance is subjective, but for anyone working on cross-project changes, I believe this is highly visible (cf the example above for instance): spending time tracking why a bot failed when I commit is not a productive way of spending my time.
> 
> I agree that the benefits of monorepo are worth highlighting, but I don't think this is the right place to do it.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161009/0a88ff61/attachment-0001.html>


More information about the llvm-commits mailing list