[cfe-dev] RFC: Clang Automatic Bug Reporting

Mon Jul 19 08:44:45 PDT 2010

Hi all,

It would be great if Clang provided better support for users who want to file
bugs. We frequently have to go through multiple iterations to get all the data
we need, and we also sometimes run into cases where it is very hard / impossible
to reproduce a problem locally. The latter problems tend to show up frequently
with precompiled headers or supplemental files like header maps, where they
depend very much on the build system and the layout of source on the users
system -- such tests can be a real pain to analyze right now.

The following proposal is for new feature work to let Clang automatically
generate bug reports.

Goals
=====

Frontend / Single-File Focused:

  The goal of this work is to support generating bug reports for parse failures,
  crashes, and trivial miscompiles. It is not designed to support generating
  test cases where a large application is miscompiled, for example. Generally,
  it is designed to support the case where the user runs a single Clang command
  on their system, it doesn't work (crashes, produces obviously invalid output,
  etc.), and they want a Clang developer to be able to reproduce the problem.

Easy-to-use:

  We want people to use it, so it has to be simple and it has to work almost all
  the time.

Near-perfect Bug Reproduction:

  We want it to be almost guaranteed that the generated bug report reproduces
  the problem. It isn't possible to be perfect, but we would like to get very
  close.

Report Non-Compiler API Bugs:

  Currently, bugs in the compiler are usually easy to reproduce for users who
  know how to generate preprocessed or LLVM IR files. However, bugs in
  other areas of Clang like the libclang interfaces are much harder to
  reproduce. Any solution should address (or help address) this problem.

Support auto-minimizing / anonymizing test cases:

  This won't happen soon, but I would like any solution to support this in some
  reasonable fashion. This is primarily a nice to have, but it is also important
  because it makes it more likely users will actually bother to submit a test
  case in situations where they are worried about disclosing their source code.

User Interface
==============

The Clang driver will get a two new options:

 '--create-test-case PATH'

   This will cause the driver to create a self-contained test case at PATH,
   which contains enough information to reproduce all the actions the compiler
   is taking as much as possible.

 '--replay-test-case PATH'

   This will cause the driver to replay the test case as best as possible. The
   driver will still support additional command line options, so the usual use
   model would be to run '--replay-test-case' to verify the problem reproduces,
   then either fix the problem directly or use additional command line options
   (-E, -###, -emit-llvm, etc.) to isolate / minimize the problem.

Implementation
==============

Conceptually, what we want to capture in the test case is as much of the users
environment as is required to reproduce the problem. The environment consists of
a lot of things which might change the compilers behavior: the OS, the hardware,
the file system, the environment variables, the command line options, the
behavior of external programs, etc. We obviously cannot package up all of these
things, but Clang is portable and always a cross compiler, and most bugs can be
reproduced on different hardware or a different OS (with the right options).

The implementation is to try and capture each piece of the environment as best
we can:

 - For the OS and hardware, we will just record the OS and CPU information, and
   when replaying the test case we will use that information instead of the host
   information. This will require a few additional hooks, but should be
   straightforward.

 - Command line arguments and environment variables can just be saved to the
   test case and restored on replay.

 - For external programs the driver calls like 'as' and 'ld', all we can expect
   to do in general is store the version information for the program, so that
   developers can at least try to replicate the host environment if necessary
   (and if the failure actually depends on the particular version of one of
   those tools, which it usually doesn't).

 - The file system is the main piece we cannot currently deal with. Usually we
   have users give us a preprocessed files to avoid depending on the users file
   system, but this does not always suffice to reproduce problems.

   My plan here is to rework parts of Clang to add support for a "virtual file
   system" which would live under the FileManager API layer.  When the driver is
   generating test cases, it would use this interface to keep track of all the
   directories and files that are accessed as part of the compilation, and it
   would serialize all this information (the file metadata and contents) into
   the bug report. When the driver is replaying a test case, it would construct
   a new virtual file system (like a private chroot, essentially) from the bug
   report. This is the main implementation work, described below.

Virtual File System
===================

My plan is to add a new LLVM interface which abstracts high-level access to the
file system. This interface will live at the llvm/Support level, and is designed
to have a thin API -- it won't be a full VFS layer, but rather it will support
the higher level LLVM file operations, i.e. getting a MemoryBuffer or
raw_ostream. We will also need additional interfaces to support things like
stat() or quickly testing file existence. The Support library will provide a
default implementation of the VFS interface which uses the normal file system. I
don't have a sketch of the API yet, but I'm confident we can achieve something
clean.

Once the llvm/Support level VFS interface is in place, the Clang
CompilerInstance object will get a VFS object. I will then refactor all the
Clang IO access to go through this object. The main piece here is changing the
FileManager to live on top of the VFS object, but all the other places the
driver & frontend access files will need to move as well.

There is also some possibility that once this work is done we can simplify some
existing interfaces, for example the current file remapping APIs or PTH's stat
cache.

Discussion
==========

The VFS based approach may seem over-the-top, but there are a couple reasons I
like this approach as opposed to others:

 - The only real other alternative is to try to make the driver smart enough to
   rewrite and repackage up various local paths when making the test cases
   (preprocessed inputs are not a viable alternative), then use the remapping
   APIs to mimick the users environment. This would be very hard to implement
   correctly, and would be brittle in the face of changes to the frontend.

 - I think this approach is fairly simple to implement. We will need to spend a
   fair amount of time getting the VFS interface right, so that it is performant
   and clean, but otherwise each implementation step should be straightforward.

 - This approach should be very robust in so far as reproducing bugs. Although
   memory layout and other very low-level details will change, the intent is
   that everything in the compiler above the VFS layer should behave exactly as
   it would on the users system, assuming an accurate bug report.

 - The downside of this approach is that bug reports will by default include a
   substantial amount of information. I am ok with this tradeoff, because my
   number one priority is to be able to reproduce the bug. I eventually hope to
   solve this problem by having a tool which, once it has a reproducable bug
   report, will try to weed out the non-essential parts (for example, trying to
   switch to a preprocessed input).

Future Work
===========

Once the basic Clang driver features are in place, we should be able to use the
same infrastructure to generate bug reports from other API entry points (like
libclang). Most of these will involve just packaging up the basic bug report as
the driver would, and adding whatever additional metadata is needed to identify
the API call and any extra metadata (file remapping information, for example).

At some point, we can also think about writing an independent tool which would
take a Clang generated bug report and attempt to minimize it. For example, it
would try to simplify the test steps (i.e., see if it can reproduce with a
preprocessed input), and eventually could use a Clang based delta tool to
minimize the input source.

---

There will be lots more details to be sorted out, but I wanted to give a heads
up on the basic approach I am planning on taking, assuming I can find the time
to work on this. Comments appreciated!

 - Daniel