[cfe-dev] Sequential ID Git hook

Thu Jun 30 04:42:56 PDT 2016

Now that we seem to be converging to an acceptable Git model, there
was only one remaining doubt, and that's how the trigger to update a
sequential ID will work. I've been in contact with GitHub folks, and
this is in line with their suggestions...

Given the nature of our project's repository structure, triggers in
each repository can't just update their own sequential ID (like
Gerrit) because we want a sequence in order for the whole project, not
just each component. But it's clear to me that we have to do something
similar to Gerrit, as this has been proven to work on a larger
infrastructure.

Adding an incremental "Change-ID" to the commit message should
suffice, in the same way we have for SVN revisions now, if we can
guarantee that:

 1. The ID will be unique across *all* projects
 2. Earlier pushes will get lower IDs than later ones

Other things are not important:

 3. We don't need the ID space to be complete (ie, we can jump from
123 to 125 if some error happens)
 4. We don't need an ID for every "commit", but for every push. A
multi-commit push is a single feature, and doing so will help
buildbots build the whole set as one change. Reverts should also be
done in one go.

What's left for the near future:

 5. We don't yet handle multi-repository patch-sets. A way to
implement this is via manual Change-ID manipulation (explained below).
Not hard, but not a priority.

  Design decisions

This could be a pre/post-commit trigger on each repository that
receives an ID from somewhere (TBD) and updates the commit message.
When the umbrella project synchronises, it'll already have the
sequential number in. In this case, the umbrella project is not
necessary for anything other than bisect, buildbots and releases.

I personally believe that having the trigger in the umbrella project
will be harder to implement and more error prone.

The server has to have some kind of locking mechanism. Web services
normally spawn dozens of "listeners", meaning multiple pushes won't
fail to get a response, since the lock will be further down, after the
web server.

Therefore, the lock for the unique increment ID has to be elsewhere.
The easiest thing I can think of is a SQL database with auto-increment
ID. Example:

Initially:
sql> create table LLVM_ID ( id int not null primary key
auto_increment, repository varchar not null, hash varchar nut null );
sql> alter table LLVM_ID auto_increment = 300000;

On every request:
sql> insert into LLVM_ID values ("$repo_name", "$hash");
sql> select_last_inset_id(); -> return

and then print the "last insert id" back to the user in the body of
the page, so the hook can update the Change-id on the commit message.
The repo/hash info is more for logging, debugging and conflict
resolution purposes.

We also must limit the web server to only accept connections from
GitHub's servers, to avoid abuse. Other repos in GitHub could still
abuse, and we can go further if it becomes a problem, but given point
(3) above, we may fix that only if it does happen.

This solution doesn't scale to multiple servers, nor helps BPC
planning. Given the size of our needs, it not relevant.

  Problems

If the server goes down, given point (3), we may not be able to
reproduce locally the same sequence as the server would. Meaning
SVN-based bisects and releases would not be possible during down
times. But Git bisect and everything else would.

Furthermore, even if a local script can't reproduce exactly what the
server would do, it still can make it linear for bisect purposes,
fixing the local problem. I can't see a situation in which we need the
sequence for any other purpose.

Upstream and downstream releases can easily wait a day or two in the
unlucky situation that the server goes down in the exact time the
release will be branched.

Migrations and backups also work well, and if we use some cloud
server, we can easily take snapshots every week or so, migrate images
across the world, etc. We don't need duplication, read-only scaling,
multi-master, etc., since only the web service will be writing/reading
from it.

All in all, a "robust enough" solution for our needs.

  Bundle commits

Just FYI, here's a proposal that appeared in the "commit message
format" round of emails a few months ago, and that can work well for
bundling commits together, but will need more complicated SQL
handling.

The current proposal is to have one ID per push. This is easy by using
auto_increment. But if we want to have one ID per multiple pushes, on
different repositories, we'll need to have the same ID on two or more
"repo/hash" pairs.

On the commit level, the developer adds a temporary hash, possibly
generated by a local script in 'utils'. Example:

  Commit-ID: 68bd83f69b0609942a0c7dc409fd3428

This ID will have to be the same on both (say) LLVM and Clang commits.

The script will then take that hash, generate an ID, and then if it
receives two or more pushes with such hashes, it'll return the *same*
ID, say 123456, in which case the Git hooks on all projects will
update the commit message by replacing the original Commit-ID to:

  Commit-ID: 123456

To avoid hash clashes in the future, the server script can refuse
existing hashes that are a few hours old and return error, in which
case the developer generates a new hash, update all commit messages
and re-push.

If there is no Commit-ID, or if it's empty, we just insert a new empty
line, get the auto increment ID and return. Meaning, empty Commit-IDs
won't "match" any other.

To solve this on the server side, a few ways are possible:

A. We stop using primary_key auto_increment, handle the increment in
the script and use SQL transactions.

This would be feasible, but more complex and error prone. I suggest we
go down that route only if keeping the repo/hash information is really
important.

B. We ditch keeping record of repo/hash and just re-use the ID, but
record the original string, so we can match later.

This keeps it simple and will work for our purposes, but we'll lose
the ability to debug problems if they happen in the future.

C. We improve the SQL design to have two tables:

LLVM_ID:
   * ID: int PK auto
   * Key: varchar null

LLVM_PUSH:
   * LLVM_ID: int FK (LLVM_ID:ID)
   * Repo: varchar not null
   * Push: varchar not null

Every new push updates both tables, returns the ID. Pushes with the
same Key re-use the ID and update only LLVM_PUSH, returns the same ID.

This is slightly more complicated, will need to code scripts to gather
information (for logging, debug), but give us both benefits
(debug+auto_increment) in one package. As a start, I'd recommend we
take this route even before the script supports it. But it could be
simple enough that we add support for it right from the beginning.

I vote for option C.

  Deployment

I recommend we code this, setup a server, let it running for a while
on our current mirrors *before* we do the move. A simple plan is to:

* Develop the server, hooks and set it running without updating the
commit message.
* We follow the logs, make sure everything is sane
* Change the hook to start updating the commit message
* We follow the commit messages, move some buildbots to track GitHub
(SVN still master)
* When all bots are live tracking GitHub and all developers have moved, we flip.

Sounds good?

cheers,
--renato