[llvm-dev] [RFC] One or many git repositories?

Thu Sep 8 11:08:03 PDT 2016

Mehdi Amini via llvm-dev <llvm-dev at lists.llvm.org> writes:

> First, have you read this document: https://reviews.llvm.org/D24167 ?
>
> TLDR: The answer is no: you have to see it as it is today, i.e. a
> single SVN repo containing all the sub-projects, and “exports” in
> individual repositories.

> The same thing after: a single git repo containing all the subprojects
> side-by-side and the *same* “exports” in individual repositories.

Sorry, I sent my earlier reply today before I intended to.

After going back and reading the proposal again, I think I understand
the plan.  I haven't used the SVN repository for years so I was thinking
in terms of git, that you'd take the existing git mirrors and combine
them (visa submodule or some other mechanism).  I understand now the
proposal is to take the SVN root and export all of that as one giant git
repository.  Is that correct?

If so, that raises a number of questions for me that aren't directly
addressed in the document as far as I can see:

1. How are the individual component git mirrors going to be maintained?

If a commit goes to the monorepository, what is going to extract the
relevant bits and commit them to the individual mirrors?  The document
notes that with a monorepository a single commit can touch multiple
projects (that's good!) but something has to extract the parts of that
commit that are relevant to each subproject and then send those parts to
the subproject repository.  There are tools to do this and I think
git-subtree is a good candidate [disclosure: I am the git-subtree
maintainer] but I'm just curious what's being considered as a solution.

2. Is there any consideration for restructuring the directory layout?

The document has this to say about checking out multiple components:

> **Monorepo Proposal**
> 
> The repository contains natively the source for every sub-projects at the right
> revision, which makes this straightforward::
> 
>   git clone https://github.com/llvm/llvm-projects.git llvm
>   cd llvm
>   git checkout $REVISION
> 
> As before, at this point clang, llvm, and libcxx are stored in directories
> alongside each other.

The problem here is that for the build, clang wants to be in llvm/tools
and other components want to be in other places.  Should the
monorepository just be structured to have everything in its correct
place for building?  My inclination is to say "no" because it reduces
the visibility of the subprojects, but what are the alternatives?  There
are two that come to mind off the top of my head, 1) include symlinks in
the repository or 2) change the build so all components can live at the
top level.

I think it's important to think about these kinds of questions because
once a repository layout has been settled on, it's hard to change.  Yes,
it is relatively easy to move entire directories to new places in git,
but that not only would require changes to whatever entity updates the
subproject repositories, it's potentially a huge social issue, which are
typically the most difficult problems to address.  :)

3. How are the subproject repositories going to be created/migrated?

The individual subproject repositories will have to be created from
scratch after the monrepository is created, right?  We can't just
transition the existing git mirrors to the new setup, correct?  A
subproject repository reboot would involve some not insignificant pain
for downstream users because their git histories are suddenly invalid.
They would have to fetch a completely different repository and integrate
it into whatever they have.

If there is some way to maintain the existing git mirrors and layer new
monorepository commits on top of the existing history that would be
fantastic.  I believe it is technically possible (I might need to add
some enhancements to git-subtree :)) but I don't know if anyone has
explored this.  I would love to be told you all have the answers
already.  :)

Bisecting

For the multirepository proposal, the document talks about having the
git-bisect run script update each submodule during bisection.  I suppose
that will work but the bisection would only report that the failure
exists at a particular commit in the umbrella repository, implying a
bunch of different commits, one for each subproject.  It wouldn't really
point to a particular subproject as being the culprit, correct?  The
document even hints at this: "it is possible that one commit in the
umbrella repository includes multiple commits in the sub-projects"

That's what I was getting at with my submodule bisect question.  It can
only bisect to a granularity of "one of these subprojects at their
respective commits caused the problem."  With a true monorepository
bisect can drill down to the exact commit within a subproject or across
multiple subprojects if the commit touched multiple subprojects.  To me
this is a giant advantage of a non-submodule-based monorepository, which
I think is what the monorepository proposal is.

If everything I've written here is generally correct, I think the
monorepository will work for us, as long as each subproject repository
is maintained at a granularity of one subproject commit per commit to
the corresponding directory in the monorepository (i.e. full history is
maintained).

Thanks for you work on this.  This kind of work is crucially important
but often unrecognized and underappreciated.

                                 -David