[LLVMdev] Proposal: add support for profile instrumentation based code coverage

Mon Jun 16 17:08:34 PDT 2014

I've been investigating better support for code coverage in Clang/LLVM.
I've spent some time experimenting with several prototypes and have arrived
at this proposal. I'd like to get some feedback on the overall approach
before I start submitting patches.

Background:
-----------
Right now LLVM supports code coverage testing using llvm-cov, which is
currently intended to work like gcov and reuses the same data files.
However, the recent addition of instrumentation based profiling to clang
(the -fprofile-instr-generate command line option) enabled us to do a much
more accurate code coverage for the languages supported by clang. The
counters are emitted in the frontend, and therefore we can produce an
accurate mapping from the source ranges extracted out of the AST to the
corresponding profiling counters generated by the frontend.

Proposal:
---------
I would like to add a language independent coverage mapping data format to
LLVM. This mapping data would be generated by a frontend such as clang, and
embedded into LLVM IR in the form of a binary blob which will get a
separate object file section. Then, a tool like llvm-cov would extract this
data from the object file, load the gathered profiling counter values
obtained after a doing an instrumented run, and display accurate code
coverage information for the desired portion of the program's source code.

Here are my main design goals:

* The format should strive to be as language independent as possible so
that it could be reused by frontends other than clang.

* The format should enable the frontend to provide accurate source ranges
for the source constructs with different execution counts.

* The format should enable the frontend to minimize the size of the
coverage mapping data (crucial for clang), but it shouldn't restrict it to
just size minimization - very accurate (thus somewhat redundant) per
statement or per expression coverage should be possible for the frontends
that desire it.

* The format should enable accurate code coverage for C's macros and
includes embedded inside the source code of a function.

* The format should have enough data so that the tools that will work with
it will be able to show accurate locations for the different execution
counts, highlight the various source ranges based on their execution
counts, and show expansions of macros and embedded includes with accurate
coverage for the source code inside the macro/embedded include.

In addition to the mapping format, I would like to add support for the
generation of the coverage mapping data to clang and support for the
analysis of this data to llvm-cov.

Summary of the Coverage Mapping Format:
---------------------------------------
Each function instrumented by a frontend like clang will emit mapping
information that the coverage tool can use to map between the source code's
regions and counter values, which are obtained during an instrumented run.
The mapping information for an instrumented function contains the following
arrays:

* Counter Expression Array
* Mapping Region Array
* Optional Removed Region Array
* File Name Array + Optional Virtual File Mapping Array.
* Optional Expansion Region Array

The mapping information is composed of the following data structures:

* Counter:

  A counter is an abstract value that on its own, or as a part of a counter
expression is able to produce an integer value which represents the number
of times that a particular region of code has been executed.

  There are three types of counters: reference to the actual counter value
from the instrumented run, reference to a counter expression, and zero.

  A reference to the counter value stores an integer index into the array
with the gathered counter values.

  A reference to the expression stores an integer index into the expression
array.

  Because the coverage tool should be able to resolve counter value
references to the actual counter values from the instrumented run, we also
need to know the function’s name, as it acts as a key when looking up the
values in a file. In clang, it is possible to reuse the function name that
is already stored by the profile instrumentation (i.e.
__llvm_profile_name_???).

* Counter Expression:

  An expression is a special kind of counter that represents an arithmetic
operation on two counters (expressions as well, as counters can be
references to expressions).

  There are two types of expressions: integer addition and integer
subtraction.

  Subtraction expressions represent constructs like the else part of
clang's if construct (i.e. Execution count of the else region = Execution
count of the if's parent region - Execution count of the if's then region).

  Addition expressions represent constructs that fall through to other
constructs, like multiple case statements in a switch that follow each
other, or don't have a break statement.

  Expressions are distinct, so the frontend won't emit redundant
expressions that duplicate each other. The coverage mapping library will
provide a builder class that will fold the same expressions into one.

* Mapping Region:

  The primary goal of a mapping region is to associate a source range with
a counter. The source range includes a file id, the starting and ending
locations (line and column), and the counter.

  A mapping region that is unreachable can be represented by a counter zero.

  The mapping regions are stored in a sorted order (by file id and starting
location) to achieve more compact binary representation (discussed later).

* Removed Region

  A removed region represents a region of code that doesn't have an
execution count. It is needed so that the coverage tool can show that it's
not executed. It includes a file id, and the starting and ending locations.
An example of a removed region is a block of code that is removed by C's
preprocessor in an #if or similar preprocessor constructs. For more details
on why we need removed regions, please see the Clang Specific Details
section.

* File Name + Virtual File Mapping

  A mapping region is useless to a coverage tool unless we know what file
it is in. As some frontends like clang can have mapping regions that come
from different files but are in the same function, we need to store a file
id with each mapping/removed/expansion region.

  The file id is an index into the virtual file mapping array. Thus the
file id is a actually a virtual file id, as it might not correspond to a
proper source code file, because the file ids can also represent a
particular expansion of a C macro or an embedded include. The virtual file
mapping array contains indices to the file name array that stores unique
absolute file names of the program's source files.

  The file name array contains only the file names that have actual source
code for this function. The file name and the virtual file mapping arrays
are stored for each function.

  In order to achieve more compact representation the file name array might
be implemented as a per TU/LLVM module array instead of a per function
array, or it might somehow store references to the file names in the debug
information instead of the strings containing the actual file names.

  The virtual file mapping array is optional, because when it represents an
identity mapping from the file id's to the file names, it is not stored in
the mapping data.

* Expansion Region

 An expansion region represents an expansion of a virtual file (i.e. the
use of a macro or an embedded textual include). It includes an expanded
file id, a file id, and the starting and ending locations. It's primarily
intended to be used for C's macros and embedded includes, but can be used
by the languages which have similar features (e.g. Fortran with its INCLUDE
statement). The expanded file id can used by the code coverage tool to
gather and display the mapping regions which are produced for this
particular expansion of a macro or an embedded include.

Compact Binary Representation:
------------------------------
The size of the coverage mapping data is one of our biggest concerns. For
that reason I propose the following compacting measures:

* All integer values are stored as unsigned integers using DWARF's LEB128
encoding. This is a variable length encoding format, which is very space
efficient for small numbers (1 byte for integers less than 128).

* The counter is stored as one unsigned integer. The 2 lowest bits are
taken for the counter tag, which distinguishes between counter references,
addition expression references, subtraction expression references, and
zero. The other bits are used for the counter index or the expression index.

* The counter expressions are stored as just two counters.

* Because the mapping region array is sorted by file id and by starting
location, we can store the mapping region array as a sequence of subarrays
without any file id information at all (file id is inferred from the index
of the subarray in the sequence).

* The source locations for the subarrays of mapping regions are stored in
the following way:
  Starting line is stored as an unsigned integer representing the
difference between the current starting line and the starting line of the
previous mapping region.
  Ending line is stored as an unsigned integer representing the difference
between the current ending line and the current starting line.
  Starting and ending columns are stored as unsigned integers.
  The mapping regions are sorted, thus most of the starting line deltas
will be hopefully small enough (<=127) to fit in one byte.

* The virtual file mapping isn't stored if the mapping represents an
identity mapping from the file id's to the filenames.

Clang Specific Details:
-----------------------
Clang doesn't emit and therefore doesn't instrument some functions - in
particular the unused static functions and the unused class methods defined
inside class definition or defined outside of the class definition with the
inline attribute. For those function, we can emit a special coverage
mapping information which will contain a mapping region containing the full
source range of the function and the counter zero. It might be also
possible to use the same approach for the template functions without any
specializations.

Clang will try to minimize the size of the coverage data as much as
possible. For this reason, instead of producing a mapping region for each
statement/expression, clang will produce nested regions which correspond to
the starting and ending locations of source constructs like the body of a
function or the body of a loop. For this reasons, the coverage tool will
show an execution counter even for empty/comment lines contained inside
those regions. This is also the reason why clang needs the removed regions,
because without them, the nested regions containing the removed blocks of
code will tell the coverage tool that this code was actually executed.

Summary of the LLVM Coverage Tool:
----------------------------------
At the moment llvm-cov is intended to work like gcov. With the new approach
to code coverage, a command line flag or an instruction (the first command
line argument) will have to be added to distinguish between the code
coverage with the new approach and the old one. Alternatively, a new tool
can be created.

The coverage tool will require to know the name of the object file and the
name of the file with the counter values in order to show code coverage
details for a portion of the program's source code. Those two names will
typically be passed as command line arguments.

The coverage tool will have several ways to determine for which portion of
the program's source code should code coverage details be displayed. The
user of the tool can supply the source code filenames so that it will show
the code coverage statistics for those files. If no filenames are supplied,
the tool will show code
coverage details for all the known source code files in that program.
Alternatively, a user can supply a name of a function or a regular
expression that will display code coverage for the matching function(s)
only.

The tool will display code coverage details using several approaches which
can be combined as desired:

* By default, a per-line approach will be used, where each line that is
covered by a mapping region will display the number of times that this line
was executed. In the cases where there are multiple mapping regions on one
line, the execution count of the last one will be displayed.

* Alternatively, instead of showing per-line execution counts, each line
that has a start of a mapping region will show the number of times that
this region was executed underneath this line. This will allow us to show
execution counts for multiple mapping regions on the same line.

* The regions of code that have the execution count of zero will be
highlighted with a red background colour.

* The expansion regions can be expanded to show the coverage details for
the corresponding source code inside macros/embedded includes.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140616/66f06dba/attachment.html>