[cfe-dev] RFC: Clang Automatic Bug Reporting

Wed Jul 21 03:41:13 PDT 2010

On 19 Jul 2010, at 16:44, Daniel Dunbar wrote:

> Hi all,
> 
> It would be great if Clang provided better support for users who want to file
> bugs. We frequently have to go through multiple iterations to get all the data
> we need, and we also sometimes run into cases where it is very hard / impossible
> to reproduce a problem locally. The latter problems tend to show up frequently
> with precompiled headers or supplemental files like header maps, where they
> depend very much on the build system and the layout of source on the users
> system -- such tests can be a real pain to analyze right now.

I have done a fair amount of work on generating and reducing clang bugs. As time goes on, the bugs found are involving more complex code. I think a good system like you describe would be a good idea. The VFS seems like a good idea. I will just say a few words on automatic reduction. The short version is "I wouldn't try it, at least as a first pass", although anonymisation is a good idea.

Automatic reduction can be very expensive. A bug involving boost can often lead to megabytes of pre-processed code, and > 10 second compile times. In this situation, any kind of testcase reduction can bisection can take a day or longer. We should certainly warn users before we start running such a process!

I don't know how much better a clang-based system could do. I would suggest concentrating on anonymizing rather than auto-minimizing test cases, although anonymization would probably involve some degree of reduction. I think the important thing for users is that it is fast, and removes as many details of their code as possible. Rather than worrying about it being particularlysmaller. 

I imagine that quite a lot of progress could be made on anonymisation by a certainly degree of random shuffling, for example randomly swapping < for > or == (allowing for overloaded operators of course), adding and removing constants, changing names. This wouldn't always work, but would be quick, and likely make the code extremely hard to reconstruct.

For reduction I use a cobbled-together bunch of scripts. These are fairly successful, if fairly expensive. One problem with automatic testcase reduction is trying to keep the code valid, if it was originally, is usually a good idea. I do this by using g++, which at least produces an approximation of "valid" ;) 

The main thing I find my code cannot reduce well (as it does not understand C++ is)

1) Constructions involving things like enable_if, 
2) Removing parameters to functions and templates which are not used (which can be a big help if those parameters are some very complex type), as they have to be removed at both the definition and call sites at the same time.

This of course ignores one big class of bugs, "wrong result" bugs (as opposed to the compiler misbehaving). Reducing such bugs is extremely difficult. The best I have done is simultaneously running the reduced code through g++ and clang++, and running the resulting executables valgrind (to try to avoid the code happening to produce the correct result while reading from uninitialized memory). Even with this, such reductions usually end up in an incorrect local minima if attempted automatically.

> 
> The following proposal is for new feature work to let Clang automatically
> generate bug reports.
> 
> 
> Goals
> =====
> 
> Frontend / Single-File Focused:
> 
>  The goal of this work is to support generating bug reports for parse failures,
>  crashes, and trivial miscompiles. It is not designed to support generating
>  test cases where a large application is miscompiled, for example. Generally,
>  it is designed to support the case where the user runs a single Clang command
>  on their system, it doesn't work (crashes, produces obviously invalid output,
>  etc.), and they want a Clang developer to be able to reproduce the problem.
> 
> Easy-to-use:
> 
>  We want people to use it, so it has to be simple and it has to work almost all
>  the time.
> 
> Near-perfect Bug Reproduction:
> 
>  We want it to be almost guaranteed that the generated bug report reproduces
>  the problem. It isn't possible to be perfect, but we would like to get very
>  close.
> 
> Report Non-Compiler API Bugs:
> 
>  Currently, bugs in the compiler are usually easy to reproduce for users who
>  know how to generate preprocessed or LLVM IR files. However, bugs in
>  other areas of Clang like the libclang interfaces are much harder to
>  reproduce. Any solution should address (or help address) this problem.
> 
> Support auto-minimizing / anonymizing test cases:
> 
>  This won't happen soon, but I would like any solution to support this in some
>  reasonable fashion. This is primarily a nice to have, but it is also important
>  because it makes it more likely users will actually bother to submit a test
>  case in situations where they are worried about disclosing their source code.
> 
> 
> User Interface
> ==============
> 
> The Clang driver will get a two new options:
> 
> '--create-test-case PATH'
> 
>   This will cause the driver to create a self-contained test case at PATH,
>   which contains enough information to reproduce all the actions the compiler
>   is taking as much as possible.
> 
> '--replay-test-case PATH'
> 
>   This will cause the driver to replay the test case as best as possible. The
>   driver will still support additional command line options, so the usual use
>   model would be to run '--replay-test-case' to verify the problem reproduces,
>   then either fix the problem directly or use additional command line options
>   (-E, -###, -emit-llvm, etc.) to isolate / minimize the problem.
> 
> 
> Implementation
> ==============
> 
> Conceptually, what we want to capture in the test case is as much of the users
> environment as is required to reproduce the problem. The environment consists of
> a lot of things which might change the compilers behavior: the OS, the hardware,
> the file system, the environment variables, the command line options, the
> behavior of external programs, etc. We obviously cannot package up all of these
> things, but Clang is portable and always a cross compiler, and most bugs can be
> reproduced on different hardware or a different OS (with the right options).
> 
> The implementation is to try and capture each piece of the environment as best
> we can:
> 
> - For the OS and hardware, we will just record the OS and CPU information, and
>   when replaying the test case we will use that information instead of the host
>   information. This will require a few additional hooks, but should be
>   straightforward.
> 
> - Command line arguments and environment variables can just be saved to the
>   test case and restored on replay.
> 
> - For external programs the driver calls like 'as' and 'ld', all we can expect
>   to do in general is store the version information for the program, so that
>   developers can at least try to replicate the host environment if necessary
>   (and if the failure actually depends on the particular version of one of
>   those tools, which it usually doesn't).
> 
> - The file system is the main piece we cannot currently deal with. Usually we
>   have users give us a preprocessed files to avoid depending on the users file
>   system, but this does not always suffice to reproduce problems.
> 
>   My plan here is to rework parts of Clang to add support for a "virtual file
>   system" which would live under the FileManager API layer.  When the driver is
>   generating test cases, it would use this interface to keep track of all the
>   directories and files that are accessed as part of the compilation, and it
>   would serialize all this information (the file metadata and contents) into
>   the bug report. When the driver is replaying a test case, it would construct
>   a new virtual file system (like a private chroot, essentially) from the bug
>   report. This is the main implementation work, described below.
> 
> 
> Virtual File System
> ===================
> 
> My plan is to add a new LLVM interface which abstracts high-level access to the
> file system. This interface will live at the llvm/Support level, and is designed
> to have a thin API -- it won't be a full VFS layer, but rather it will support
> the higher level LLVM file operations, i.e. getting a MemoryBuffer or
> raw_ostream. We will also need additional interfaces to support things like
> stat() or quickly testing file existence. The Support library will provide a
> default implementation of the VFS interface which uses the normal file system. I
> don't have a sketch of the API yet, but I'm confident we can achieve something
> clean.
> 
> Once the llvm/Support level VFS interface is in place, the Clang
> CompilerInstance object will get a VFS object. I will then refactor all the
> Clang IO access to go through this object. The main piece here is changing the
> FileManager to live on top of the VFS object, but all the other places the
> driver & frontend access files will need to move as well.
> 
> There is also some possibility that once this work is done we can simplify some
> existing interfaces, for example the current file remapping APIs or PTH's stat
> cache.
> 
> 
> Discussion
> ==========
> 
> The VFS based approach may seem over-the-top, but there are a couple reasons I
> like this approach as opposed to others:
> 
> - The only real other alternative is to try to make the driver smart enough to
>   rewrite and repackage up various local paths when making the test cases
>   (preprocessed inputs are not a viable alternative), then use the remapping
>   APIs to mimick the users environment. This would be very hard to implement
>   correctly, and would be brittle in the face of changes to the frontend.
> 
> - I think this approach is fairly simple to implement. We will need to spend a
>   fair amount of time getting the VFS interface right, so that it is performant
>   and clean, but otherwise each implementation step should be straightforward.
> 
> - This approach should be very robust in so far as reproducing bugs. Although
>   memory layout and other very low-level details will change, the intent is
>   that everything in the compiler above the VFS layer should behave exactly as
>   it would on the users system, assuming an accurate bug report.
> 
> - The downside of this approach is that bug reports will by default include a
>   substantial amount of information. I am ok with this tradeoff, because my
>   number one priority is to be able to reproduce the bug. I eventually hope to
>   solve this problem by having a tool which, once it has a reproducable bug
>   report, will try to weed out the non-essential parts (for example, trying to
>   switch to a preprocessed input).
> 
> 
> Future Work
> ===========
> 
> Once the basic Clang driver features are in place, we should be able to use the
> same infrastructure to generate bug reports from other API entry points (like
> libclang). Most of these will involve just packaging up the basic bug report as
> the driver would, and adding whatever additional metadata is needed to identify
> the API call and any extra metadata (file remapping information, for example).
> 
> At some point, we can also think about writing an independent tool which would
> take a Clang generated bug report and attempt to minimize it. For example, it
> would try to simplify the test steps (i.e., see if it can reproduce with a
> preprocessed input), and eventually could use a Clang based delta tool to
> minimize the input source.
> 
> ---
> 
> There will be lots more details to be sorted out, but I wanted to give a heads
> up on the basic approach I am planning on taking, assuming I can find the time
> to work on this. Comments appreciated!
> 
> - Daniel
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev