[llvm-dev] Program Repository

Tue Nov 20 06:50:06 PST 2018

Hi all,

TL;DR: I’ve previously talked about the idea of a “program repository” which replaces object files with a database and then uses its capabilities to improve turn-around times and add new workflow possibilities. The project has now reached the stage where it’s starting to work and is producing some decent compile-time speed-ups, so we’d like to hear the thoughts of the wider community.

I gave a talk at the 2016 US LLVM Developers’ Meeting titled “A repository for statically compiled programs”  (https://youtu.be/-pL94rqyQ6c) which was about reducing turnaround times (the time between a developer starting a build and having that code running). It did this by introducing a database as the intermediate format used for communication between the tools and then taking advantage it to enable coarsely incremental compilation, faster links, and reduced debug-info load times. 

My sense was that the reception to this idea has been “cautiously positive” so we been working on fleshing out a real implementation. This is now beginning to bear fruit with a somewhat working toolchain.

Project background
==================

It has become clear that the object files used as the post-compiler intermediate format represent a significant performance bottleneck. There are a number of reasons for this:

- Large, modern C++ programs in particular create immense amounts of duplicate information. This can be duplicated strings, objects with “vague” linkage, debug metadata and so on. In some cases more than 99% of this data is simply bloat! The compiler spends time generating it and downstream tools have to process this bloat to either discard the duplicates or simply to perform unnecessary processing. Either choice hurts the performance of these tools. This could be eliminated if the compiler were able to discover (and potentially reuse) data that it emitted for early compilations.

- Compilation parallelizes nicely both locally and through distribution to remote machines. The link stage on the other hand tends to act as a “join” in the build dependent on those compilations having completed. It’s therefore beneficial to move work upstream to the compiler  whenever possible to make the linker’s work as straightforward as possible. One way in which the program repository does this is by having the compiler perform string de-duplication (which can consume as much as 30% of the link time).

- Debugging metadata in large C++ programs can extend to many gigabytes. Completely eliminating copying or processing of this data by the linker should result in significant link-time reductions.

- The metadata overhead imposed by file formats such as ELF (for additional sections and the like) can be considerable. This has a knock-on effect on build times.

Components
==========
The work currently lives in a set of three public github repositories. They are:

- llvm-prepo (https://github.com/SNSystems/llvm-prepo): The modified LLVM. We’ve tried to keep the changes are isolated as possible so can be thought of in 3 major pieces: 

   1. A new pass which generates hash digests for all of the IR objects.
   2. A pass which eliminates further processing for objects whose definitions are found in the PR. It’s intended that these passes run very early in the LLVM pipeline so as to minimize the unnecessary work.
   3. A new MC object file format (“repo”) with associated classes in the assembler.
   4. A “repo2obj” utility which generates an object file (currently just ELF) from a repository “ticket”. We’re currently using this to generate inputs for the existing static linker which is clumsy but useful for testing. (I’d like to move this out of the main llvm-prepo repository into a “prepo-extras” repository similar to "the clang-extras”).

- clang-prepo (https://github.com/SNSystems/clang-prepo): There are some small tweaks to clang to teach it about the new object-file format and to disable generation of aliases (which haven’t been implemented yet).

- pstore (https://github.com/SNSystems/pstore): At the moment, this is the storage engine for the project and there’s a hard dependency on it. We envisage providing an abstraction layer to decouple the compiler and allow the database to be replaced. Perhaps there’ll be a minimal implementation for running the tests. (It seems inevitable that no database will be ideal for every possible use-case and proposing adding one to LLVM doesn’t seem like the right thing to do.)

Note that, pstore apart, I consider the implementation to be a prototype and that a formal RFC/review process may result in significant change. Also, it’s important to note that this new functionality is in addition to the all of the existing behavior: it’s opt-in and nothing is removed!

Performance
===========
We’ve been gradually increasing the size and complexity of the code being pushed through the system (starting small and working up). It’s still early and no attempt has been made to profile or optimise the code. 

The largest project tried so far is pstore itself (~37k SLOC) which passes its unit and system tests when compiled with the llvm-prepo compiler at both -O0 or -O3. Our measurements show that the compile-time is reduced by ~40% (https://github.com/SNSystems/llvm-prepo/wiki/Performance) . This code is very small by comparison with our real target, where correspondingly greater redundancy may show greater improvement.

Summary
=======
There’s a lot of work still to do before we have a usable workflow but I’m hopeful that there’s enough from which you can build an impression of the concept. There’s some documentation, background, and data to support some of the claims I’ve made here on the llvm-prepo wiki (https://github.com/SNSystems/llvm-prepo/wiki). Please share any thoughts or questions you have…

Thanks.
Paul