[llvm-dev] A Prototype to Track Input Read for Sparse File Fuzzing
Kostya Serebryany via llvm-dev
llvm-dev at lists.llvm.org
Fri Jul 14 15:47:29 PDT 2017
On Fri, Jul 14, 2017 at 3:40 PM, Shuyang Wang <swang2 at apple.com> wrote:
> Hi Kostya,
> Thanks for getting back to me.
> My work is somewhat similar to dfsan from high level but is specifically
> designed only to identify read regions of an input file.
.. which is probably good :)
> It uses the foundmental sanitizer infrastructure, so if dfsan can be
> integrated with libfuzzer, I think my work can as well.
> Regarding dfsan with libfuzzer, can you refresh me why it was removed?
it was not used anywhere and prevented me from doing large refactoring.
I do want to reinstate it, or something similar.
> Shuyang Wang
> Security Engineering & Architecture
> On Jul 13, 2017, at 2:04 PM, Kostya Serebryany <kcc at google.com> wrote:
> This topic pops up regularly when discussing fuzzers, and not only for
> sparse input formats.
> I hope to eventually have a reasonable solution in libFuzzer itself.
> One way is to couple libFuzzer with dfsan (I even had some code for this,
> but removed it later).
> In the mean time, contribution is very welcome in various forms:
> * add micro fuzzing tests (puzzles) to https://github.com/llvm-
> * add real-life examples to https://github.com/google/fuzzer-test-suite/
> * add standalone "custom mutator" (see LLVMFuzzerCustomMutator) that uses
> your tool to apply mutations only to the relevant parts of the input.
> On Wed, Jul 12, 2017 at 1:54 PM, Shuyang Wang <swang2 at apple.com> wrote:
>> Hi everyone,
>> I wrote a prototype based on LLVM sanitizer infrastructure to improve
>> fuzzing performance, especially over sparse file format. I’d like to
>> upstream it if the anyone thinks it is useful.
>> Sparse file format are formats that only a small portion of the file data
>> could have impact on the behavior of the program that parses it. Common
>> examples are archive files or a file system image where only metadata would
>> affect program behavior. When fuzzing those formats, a general fuzzer will
>> randomly select ranges to mutate. Because of the sparse nature of the
>> formats, random range selection has a high probability to hit the "wholes"
>> where data have no influence on the parser. While applying trim over the
>> input could sometimes improve the effective range hit rate, it would not
>> always work. For instance, some program may pose a minimum file size
>> requirement which turns to be fairly large for fuzzing, or the effective
>> ranges are sparsely distributed over an entire file instead of being
>> centralized in the beginning.
>> The tool I wrote leverages the observation that a piece of data would
>> only have influence on its parser's behavior only if the data is at least
>> read out by the parser, and the read regions of a sparse file is usually
>> pretty small compared to the entire file. By generating an read map for
>> each input and feeding the map to a modified fuzzer that prioritizes
>> mutating those ranges, we noticed over 10X performance improvement in path
>> discovery at bootstrap time in our test. The modified fuzzer was also able
>> to find crashes in 0.5 hour where the original version couldn't find in 72
>> hours when we ended the test.
>> The high level idea about how the tool works is it uses an
>> instrumentation pass to record any memory read in shadow memory, while a
>> runtime tracks buffer propagation from a user specified buffer (the initial
>> buffer a file is read into), and coalesces shadow memory for these buffers.
>> A read map can be generated for each input file with the instrumented
>> I hope this is interesting to some people and I can provide more details.
>> The prototype is not ready to upstream yet, but I would like to work on it
>> if the community is interested.
>> Shuyang Wang
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev