<div dir="ltr">I don't think we should do any of that. It's too complicated -- and I don't see the reason to even do it.<div><br></div><div>There's a need for the "llvm-project" repository -- that's been discussed plenty -- but where does the need for a separate "id" that must be pushed into all of the sub-projects come from? This is the first I've heard of that as a thing that needs to be done.</div><div><br></div><div>There was a previous discussion about putting an sequential ID in the "llvm-project" repo commit messages (although, even that I'd say is unnecessary), but not anywhere else.</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Jun 30, 2016 at 7:42 AM, Renato Golin via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Now that we seem to be converging to an acceptable Git model, there<br>

was only one remaining doubt, and that's how the trigger to update a<br>

sequential ID will work. I've been in contact with GitHub folks, and<br>

this is in line with their suggestions...<br>

<br>

Given the nature of our project's repository structure, triggers in<br>

each repository can't just update their own sequential ID (like<br>

Gerrit) because we want a sequence in order for the whole project, not<br>

just each component. But it's clear to me that we have to do something<br>

similar to Gerrit, as this has been proven to work on a larger<br>

infrastructure.<br>

<br>

Adding an incremental "Change-ID" to the commit message should<br>

suffice, in the same way we have for SVN revisions now, if we can<br>

guarantee that:<br>

<br>

 1. The ID will be unique across *all* projects<br>

 2. Earlier pushes will get lower IDs than later ones<br>

<br>

Other things are not important:<br>

<br>

 3. We don't need the ID space to be complete (ie, we can jump from<br>

123 to 125 if some error happens)<br>

 4. We don't need an ID for every "commit", but for every push. A<br>

multi-commit push is a single feature, and doing so will help<br>

buildbots build the whole set as one change. Reverts should also be<br>

done in one go.<br>

<br>

What's left for the near future:<br>

<br>

 5. We don't yet handle multi-repository patch-sets. A way to<br>

implement this is via manual Change-ID manipulation (explained below).<br>

Not hard, but not a priority.<br>

<br>

<br>

  Design decisions<br>

<br>

This could be a pre/post-commit trigger on each repository that<br>

receives an ID from somewhere (TBD) and updates the commit message.<br>

When the umbrella project synchronises, it'll already have the<br>

sequential number in. In this case, the umbrella project is not<br>

necessary for anything other than bisect, buildbots and releases.<br>

<br>

I personally believe that having the trigger in the umbrella project<br>

will be harder to implement and more error prone.<br>

<br>

The server has to have some kind of locking mechanism. Web services<br>

normally spawn dozens of "listeners", meaning multiple pushes won't<br>

fail to get a response, since the lock will be further down, after the<br>

web server.<br>

<br>

Therefore, the lock for the unique increment ID has to be elsewhere.<br>

The easiest thing I can think of is a SQL database with auto-increment<br>

ID. Example:<br>

<br>

Initially:<br>

sql> create table LLVM_ID ( id int not null primary key<br>

auto_increment, repository varchar not null, hash varchar nut null );<br>

sql> alter table LLVM_ID auto_increment = 300000;<br>

<br>

On every request:<br>

sql> insert into LLVM_ID values ("$repo_name", "$hash");<br>

sql> select_last_inset_id(); -> return<br>

<br>

and then print the "last insert id" back to the user in the body of<br>

the page, so the hook can update the Change-id on the commit message.<br>

The repo/hash info is more for logging, debugging and conflict<br>

resolution purposes.<br>

<br>

We also must limit the web server to only accept connections from<br>

GitHub's servers, to avoid abuse. Other repos in GitHub could still<br>

abuse, and we can go further if it becomes a problem, but given point<br>

(3) above, we may fix that only if it does happen.<br>

<br>

This solution doesn't scale to multiple servers, nor helps BPC<br>

planning. Given the size of our needs, it not relevant.<br>

<br>

<br>

  Problems<br>

<br>

If the server goes down, given point (3), we may not be able to<br>

reproduce locally the same sequence as the server would. Meaning<br>

SVN-based bisects and releases would not be possible during down<br>

times. But Git bisect and everything else would.<br>

<br>

Furthermore, even if a local script can't reproduce exactly what the<br>

server would do, it still can make it linear for bisect purposes,<br>

fixing the local problem. I can't see a situation in which we need the<br>

sequence for any other purpose.<br>

<br>

Upstream and downstream releases can easily wait a day or two in the<br>

unlucky situation that the server goes down in the exact time the<br>

release will be branched.<br>

<br>

Migrations and backups also work well, and if we use some cloud<br>

server, we can easily take snapshots every week or so, migrate images<br>

across the world, etc. We don't need duplication, read-only scaling,<br>

multi-master, etc., since only the web service will be writing/reading<br>

from it.<br>

<br>

All in all, a "robust enough" solution for our needs.<br>

<br>

<br>

  Bundle commits<br>

<br>

Just FYI, here's a proposal that appeared in the "commit message<br>

format" round of emails a few months ago, and that can work well for<br>

bundling commits together, but will need more complicated SQL<br>

handling.<br>

<br>

The current proposal is to have one ID per push. This is easy by using<br>

auto_increment. But if we want to have one ID per multiple pushes, on<br>

different repositories, we'll need to have the same ID on two or more<br>

"repo/hash" pairs.<br>

<br>

On the commit level, the developer adds a temporary hash, possibly<br>

generated by a local script in 'utils'. Example:<br>

<br>

  Commit-ID: 68bd83f69b0609942a0c7dc409fd3428<br>

<br>

This ID will have to be the same on both (say) LLVM and Clang commits.<br>

<br>

The script will then take that hash, generate an ID, and then if it<br>

receives two or more pushes with such hashes, it'll return the *same*<br>

ID, say 123456, in which case the Git hooks on all projects will<br>

update the commit message by replacing the original Commit-ID to:<br>

<br>

  Commit-ID: 123456<br>

<br>

To avoid hash clashes in the future, the server script can refuse<br>

existing hashes that are a few hours old and return error, in which<br>

case the developer generates a new hash, update all commit messages<br>

and re-push.<br>

<br>

If there is no Commit-ID, or if it's empty, we just insert a new empty<br>

line, get the auto increment ID and return. Meaning, empty Commit-IDs<br>

won't "match" any other.<br>

<br>

To solve this on the server side, a few ways are possible:<br>

<br>

A. We stop using primary_key auto_increment, handle the increment in<br>

the script and use SQL transactions.<br>

<br>

This would be feasible, but more complex and error prone. I suggest we<br>

go down that route only if keeping the repo/hash information is really<br>

important.<br>

<br>

B. We ditch keeping record of repo/hash and just re-use the ID, but<br>

record the original string, so we can match later.<br>

<br>

This keeps it simple and will work for our purposes, but we'll lose<br>

the ability to debug problems if they happen in the future.<br>

<br>

C. We improve the SQL design to have two tables:<br>

<br>

LLVM_ID:<br>

   * ID: int PK auto<br>

   * Key: varchar null<br>

<br>

LLVM_PUSH:<br>

   * LLVM_ID: int FK (LLVM_ID:ID)<br>

   * Repo: varchar not null<br>

   * Push: varchar not null<br>

<br>

Every new push updates both tables, returns the ID. Pushes with the<br>

same Key re-use the ID and update only LLVM_PUSH, returns the same ID.<br>

<br>

This is slightly more complicated, will need to code scripts to gather<br>

information (for logging, debug), but give us both benefits<br>

(debug+auto_increment) in one package. As a start, I'd recommend we<br>

take this route even before the script supports it. But it could be<br>

simple enough that we add support for it right from the beginning.<br>

<br>

I vote for option C.<br>

<br>

<br>

  Deployment<br>

<br>

I recommend we code this, setup a server, let it running for a while<br>

on our current mirrors *before* we do the move. A simple plan is to:<br>

<br>

* Develop the server, hooks and set it running without updating the<br>

commit message.<br>

* We follow the logs, make sure everything is sane<br>

* Change the hook to start updating the commit message<br>

* We follow the commit messages, move some buildbots to track GitHub<br>

(SVN still master)<br>

* When all bots are live tracking GitHub and all developers have moved, we flip.<br>

<br>

Sounds good?<br>

<br>

cheers,<br>

--renato<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div><br></div>