[cfe-dev] RFC: Clang Automatic Bug Reporting

Mon Jul 19 10:51:30 PDT 2010

Sounds great to me!

On Jul 19, 2010, at 8:44 AM, Daniel Dunbar wrote:

> Hi all,
> 
> It would be great if Clang provided better support for users who want to file
> bugs. We frequently have to go through multiple iterations to get all the data
> we need, and we also sometimes run into cases where it is very hard / impossible
> to reproduce a problem locally. The latter problems tend to show up frequently
> with precompiled headers or supplemental files like header maps, where they
> depend very much on the build system and the layout of source on the users
> system -- such tests can be a real pain to analyze right now.
> 
> The following proposal is for new feature work to let Clang automatically
> generate bug reports.
> 
> 
> Goals
> =====
> 
> Frontend / Single-File Focused:
> 
>  The goal of this work is to support generating bug reports for parse failures,
>  crashes, and trivial miscompiles. It is not designed to support generating
>  test cases where a large application is miscompiled, for example. Generally,
>  it is designed to support the case where the user runs a single Clang command
>  on their system, it doesn't work (crashes, produces obviously invalid output,
>  etc.), and they want a Clang developer to be able to reproduce the problem.
> 
> Easy-to-use:
> 
>  We want people to use it, so it has to be simple and it has to work almost all
>  the time.
> 
> Near-perfect Bug Reproduction:
> 
>  We want it to be almost guaranteed that the generated bug report reproduces
>  the problem. It isn't possible to be perfect, but we would like to get very
>  close.
> 
> Report Non-Compiler API Bugs:
> 
>  Currently, bugs in the compiler are usually easy to reproduce for users who
>  know how to generate preprocessed or LLVM IR files. However, bugs in
>  other areas of Clang like the libclang interfaces are much harder to
>  reproduce. Any solution should address (or help address) this problem.
> 
> Support auto-minimizing / anonymizing test cases:
> 
>  This won't happen soon, but I would like any solution to support this in some
>  reasonable fashion. This is primarily a nice to have, but it is also important
>  because it makes it more likely users will actually bother to submit a test
>  case in situations where they are worried about disclosing their source code.
> 
> 
> User Interface
> ==============
> 
> The Clang driver will get a two new options:
> 
> '--create-test-case PATH'
> 
>   This will cause the driver to create a self-contained test case at PATH,
>   which contains enough information to reproduce all the actions the compiler
>   is taking as much as possible.
> 
> '--replay-test-case PATH'
> 
>   This will cause the driver to replay the test case as best as possible. The
>   driver will still support additional command line options, so the usual use
>   model would be to run '--replay-test-case' to verify the problem reproduces,
>   then either fix the problem directly or use additional command line options
>   (-E, -###, -emit-llvm, etc.) to isolate / minimize the problem.
> 
> 
> Implementation
> ==============
> 
> Conceptually, what we want to capture in the test case is as much of the users
> environment as is required to reproduce the problem. The environment consists of
> a lot of things which might change the compilers behavior: the OS, the hardware,
> the file system, the environment variables, the command line options, the
> behavior of external programs, etc. We obviously cannot package up all of these
> things, but Clang is portable and always a cross compiler, and most bugs can be
> reproduced on different hardware or a different OS (with the right options).
> 
> The implementation is to try and capture each piece of the environment as best
> we can:
> 
> - For the OS and hardware, we will just record the OS and CPU information, and
>   when replaying the test case we will use that information instead of the host
>   information. This will require a few additional hooks, but should be
>   straightforward.
> 
> - Command line arguments and environment variables can just be saved to the
>   test case and restored on replay.
> 
> - For external programs the driver calls like 'as' and 'ld', all we can expect
>   to do in general is store the version information for the program, so that
>   developers can at least try to replicate the host environment if necessary
>   (and if the failure actually depends on the particular version of one of
>   those tools, which it usually doesn't).
> 
> - The file system is the main piece we cannot currently deal with. Usually we
>   have users give us a preprocessed files to avoid depending on the users file
>   system, but this does not always suffice to reproduce problems.
> 
>   My plan here is to rework parts of Clang to add support for a "virtual file
>   system" which would live under the FileManager API layer.  When the driver is
>   generating test cases, it would use this interface to keep track of all the
>   directories and files that are accessed as part of the compilation, and it
>   would serialize all this information (the file metadata and contents) into
>   the bug report. When the driver is replaying a test case, it would construct
>   a new virtual file system (like a private chroot, essentially) from the bug
>   report. This is the main implementation work, described below.
> 
> 
> Virtual File System
> ===================
> 
> My plan is to add a new LLVM interface which abstracts high-level access to the
> file system. This interface will live at the llvm/Support level, and is designed
> to have a thin API -- it won't be a full VFS layer, but rather it will support
> the higher level LLVM file operations, i.e. getting a MemoryBuffer or
> raw_ostream. We will also need additional interfaces to support things like
> stat() or quickly testing file existence. The Support library will provide a
> default implementation of the VFS interface which uses the normal file system. I
> don't have a sketch of the API yet, but I'm confident we can achieve something
> clean.
> 
> Once the llvm/Support level VFS interface is in place, the Clang
> CompilerInstance object will get a VFS object. I will then refactor all the
> Clang IO access to go through this object. The main piece here is changing the
> FileManager to live on top of the VFS object, but all the other places the
> driver & frontend access files will need to move as well.
> 
> There is also some possibility that once this work is done we can simplify some
> existing interfaces, for example the current file remapping APIs or PTH's stat
> cache.
> 
> 
> Discussion
> ==========
> 
> The VFS based approach may seem over-the-top, but there are a couple reasons I
> like this approach as opposed to others:
> 
> - The only real other alternative is to try to make the driver smart enough to
>   rewrite and repackage up various local paths when making the test cases
>   (preprocessed inputs are not a viable alternative), then use the remapping
>   APIs to mimick the users environment. This would be very hard to implement
>   correctly, and would be brittle in the face of changes to the frontend.
> 
> - I think this approach is fairly simple to implement. We will need to spend a
>   fair amount of time getting the VFS interface right, so that it is performant
>   and clean, but otherwise each implementation step should be straightforward.
> 
> - This approach should be very robust in so far as reproducing bugs. Although
>   memory layout and other very low-level details will change, the intent is
>   that everything in the compiler above the VFS layer should behave exactly as
>   it would on the users system, assuming an accurate bug report.
> 
> - The downside of this approach is that bug reports will by default include a
>   substantial amount of information. I am ok with this tradeoff, because my
>   number one priority is to be able to reproduce the bug. I eventually hope to
>   solve this problem by having a tool which, once it has a reproducable bug
>   report, will try to weed out the non-essential parts (for example, trying to
>   switch to a preprocessed input).
> 
> 
> Future Work
> ===========
> 
> Once the basic Clang driver features are in place, we should be able to use the
> same infrastructure to generate bug reports from other API entry points (like
> libclang). Most of these will involve just packaging up the basic bug report as
> the driver would, and adding whatever additional metadata is needed to identify
> the API call and any extra metadata (file remapping information, for example).
> 
> At some point, we can also think about writing an independent tool which would
> take a Clang generated bug report and attempt to minimize it. For example, it
> would try to simplify the test steps (i.e., see if it can reproduce with a
> preprocessed input), and eventually could use a Clang based delta tool to
> minimize the input source.
> 
> ---
> 
> There will be lots more details to be sorted out, but I wanted to give a heads
> up on the basic approach I am planning on taking, assuming I can find the time
> to work on this. Comments appreciated!
> 
> - Daniel
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev