[LLVMdev] RFC: Binary format for instrumentation based profiling data

Wed Mar 12 18:09:44 PDT 2014

Instrumentation-based profiling data is generated by instrumented
binaries through library functions in compiler-rt, and read by the clang
frontend to feed PGO.

The current data format is a naive textual format.

The files consist of a number of counters for each function in a
program.

Use cases
=========

There are a few use cases that drive our requirements:

A. Looking up the counters for a particular function needs to be
  efficient, so as to not slow down compilation too much when using the
  data for PGO.  This is also presumably how this data would be
  accessed if using it for coverage.

B. The file should be relatively quick to generate, and shouldn't
  require a large amount of memory overhead to write out.  Instrumented
  programs will write this file out on termination or on demand.

C. Iterating over all of the data must be reasonably simple, to support
  tools that summarize or merge data files.

D. The format should be relatively compact.  Programs with a large
  number of functions may be instrumented, and we don't want to create
  unnecessarily large files because of it.

E. The format should be portable between systems.  Ie, if Alice
  generates profile data for some program P using Machine A, Bob should
  be able to use that same profile data to instrument a copy of that
  same program on machine B.

Partially updating the file is not a goal.  Instrumented programs write
out a profile for a single run, and multiple runs can be merged by a
separate tool.

Semantics
=========

Semantically, we have a collection of functions, each with a number of
counters associated with them.

Functions are represented by strings, determined by the part of the
frontend that both generates and uses this data.  In our case, these are
generally whatever clang thinks of as the function's name, with minor
details added to disambiguate names that aren't necessarily unique.

The counters are a list of unsigned 64 bit values.  The frontend decides
what constructs these belong to, but the first counter is for the
function itself, and always exists.  In order to validate that the
mapping is correct, we need a hash for the frontend to compare against.

Initially, we’re using the number of counters as the hash, but that
won't be sufficient as time goes on.

Format
======

The file should be a binary format in order to satisfy use case (D).  As
such, we need to choose an endianness in order to satisfy (E).  A fixed
endianness is simpler than writing in native-endian and byte-swapping if
necessary on read, so we arbitrarily use little-endian.

The file consists of three parts:

1. Header
2. Index
3. Counter data

Header
------

The header contains metadata for parsing the file, and summarizing data
that's expensive to recompute.  Specifically:

- A magic number to identify the file format.  The character string
 "LPRF" has a nice ring to it.

- A version number, to accomodate future changes to the file format.

- An offset to the end of the index.  This could alternatively be a size
 of the index, to which the header size would need to be added to get
 the offset, but that's slightly less convenient.

- The maximum value of any function counter.  This is easy to calculate
 when writing the file, but would require reading the entire file to
 calculate later.

Index
-----

An index is necessary in order to get fast look up for particular
functions (use case A), and so that we can avoid paging in the entire
data file when we're only interested in a small subset of the data.  The
index is used as an associative array from the function name to the data
offset where the function's counter data can be found.

The first implementation uses the naive approach of an unordered list
of names and offsets.

 Each element will consist of:

   <string length> <character data> <data offset>

This obviously has an obvious drawback in the "startup" cost of reading
these into an associative data structure, but it's relatively compact
and is an reasonable baseline to evaluate better solutions.

Future directions may include:

1. An on-disk hash table.  Questions like whether to use separate
   chaining or open addressing would need to be addressed.  This
   solution takes at least as much space as the naive approach, but
   avoids the startup cost.

1a. A format like LLVM's dwarf accelerator tables.  This is optimized
    for a different case than we're interested in (failed lookups rather
    than successful), so it's probably not appropriate.

2. An on-disk trie.  This should save a fair amount of space due to the
   fact that C language names and C++ mangling tend to lead to a large
   number of names with common prefixes. This format also allows
   iterating over the functions in alphabetical order.

Counter data
------------

The counter data is simply an array of unsigned 64 bit values.  Given an
offset found in the index, a sequence follows:

   <function hash> <number of counters> <counters...>

This is all of the data needed for a given function.

A potential future direction would be to represent counters with a
variable length encoding.  Many counters in a program are likely to be 0
or 1 for a particular run.  Using 8B for each is expensive.  However,
there’s a natural tradeoff against use case B, as the size of the
file is harder to predict when writing it out.