[llvm-dev] [RFC] Placing profile name data, and coverage data, outside of object files

Fri Jun 30 17:54:19 PDT 2017

Problem
-------

Instrumentation for PGO and frontend-based coverage places a large amount of
data in object files, even though the majority of this data is not needed at
run-time. All the data is needlessly duplicated while generating archives, and
again while linking. PGO name data is written out into raw profiles by
instrumented programs, slowing down the training and code coverage workflows.

Here are some numbers from a coverage + RA build of ToT clang:

  * Size of the build directory: 4.3 GB

  * Wall time needed to run "clang -help" with an SSD: 0.5 seconds

  * Size of the clang binary: 725.24 MB

  * Space wasted on duplicate name/coverage data (*.o + *.a): 923.49 MB
    - Size contributed by __llvm_covmap sections: 1.02 GB
      \_ Just within clang: 340.48 MB

    - Size contributed by __llvm_prf_names sections: 327.46 MB
      \_ Just within clang: 106.76 MB

    => Space wasted within the clang binary: 447.24 MB

Running an instrumented clang binary triggers a 143MB raw profile write which
is slow even with an SSD. This problem is particularly bad for frontend-based
coverage because it generates a lot of extra name data: however, the situation
can also be improved for PGO instrumentation.

Proposal
--------

Place PGO name data and coverage data outside of object files. This would
eliminate data duplication in *.a/*.o files, shrink binaries, shrink raw
profiles, and speed up instrumented programs.

In more detail:

1. The frontends get a new `-fprofile-metadata-dir=<path>` option. This lets
users specify where llvm will store profile metadata. If the metadata starts to
take up too much space, there's just one directory to clean.

2. The frontends continue emitting PGO name data and coverage data in the same
llvm::Module. So does LLVM's IR-based PGO implementation. No change here.

3. If the InstrProf lowering pass sees that a metadata directory is available,
it constructs a new module, copies the name/coverage data into it, hashes the
module, and attempts to write that module to:

  <metadata-dir>/<module-hash>.bc   (the metadata module)

If this write operation fails, it scraps the new module: it keeps all the
metadata in the original module, and there are no changes from the current
process. I.e with this proposal we preserve backwards compatibility.

4. Once the metadata module is written, the name/coverage data are entirely
stripped out of the original module. They are replaced by a path to the
metadata module:

  @__llvm_profiling_metadata = "<metadata-dir>/<module-hash>.bc",
                               section "__llvm_prf_link"

This allows incremental builds to work properly, which is an important use case
for code coverage users. When an object is rebuilt, it gets a fresh link to a
fresh profiling metadata file. Although stale files can accumulate in the
metadata directory, the stale files cannot ever be used.

In an IDE like Xcode, since there's just one target binary per scheme, it's
possible to clean the metadata directory by removing the modules which aren't
referenced by the target binary.

5. The raw profile format is updated so that links to metadata files are written
out in each profile. This makes it possible for all existing llvm-profdata and
llvm-cov commands to work, seamlessly.

The indexed profile format will *not* be updated: i.e, it will contain a full
symbol table, and no links. This simplifies the coverage mapping reader, because
a full symbol table is guaranteed to exist before any function records are
parsed. It also reduces the amount of coding, and makes it easier to preserve
backwards compatibility :).

6. The raw profile reader will learn how to read links, open up the metadata
modules it finds links to, and collect name data from those modules.

7. The coverage reader will learn how to read the __llvm_prf_link section, open
up metadata modules, and lazily read coverage mapping data.

Alternate Solutions
-------------------

1. Instead of copying name data into an external metadata module, just copy the
coverage mapping data.

I've actually prototyped this. This might be a good way to split up patches,
although I don't see why we wouldn't want to tackle the name data problem
eventually.

2. Instead of emitting links to external metadata modules, modify llvm-cov and
llvm-profdata so that they require a path to the metadata directory.

The issue with this is that it's way too easy to read stale metadata. It's also
less user-friendly, which hurts adoption.

3. Use something other than llvm bitcode for the metadata module format.

Since we're mostly writing large binary blobs (compressed name data or
pre-encoded source range mapping info), using bitcode shouldn't be too slow, and
we're not likely to get better compression with a different format.

Bitcode is also convenient, and is nice for backwards compatibility.

--------------------------------------------------------------------------------

If you've made it this far, thanks for taking a look! I'd appreciate any
feedback.

vedant