[PATCH] D66107: [libFuzzer] Improve -merge= process to account for REDUCED corpus units.

Wed Aug 14 07:31:30 PDT 2019

Dor1s marked 3 inline comments as done.
Dor1s added a comment.

Guys, thanks a lot for the feedback! Some answers below, I'll get back to the code soon.

In D66107#1628236 <https://reviews.llvm.org/D66107#1628236>, @morehouse wrote:

> A few high-level thoughts/questions:
>
> - What do you plan to use the improved merge stats for?

In various ClusterFuzz instances we're tracking the stats for every fuzzer run. It's important as we jungle a bunch of different so-called strategies, and now we have an automated logic that auto-assigns probabilities to the strategies based on the stats. Basically, we must have a trustworthy stats w.r.t. new coverage / corpus changes, and relying on libFuzzer's merge is our best call. Otherwise, we do a lot of hacky parsing and still have problems with stats in certain cases :(

Outside of CF, I think it's a reasonable feature when a user merges two or more corpuses, they would see whether that gave them any new coverage, etc.

> - With this change, `-merge` seems to no longer mean we're "merging" the input directories into the output directory, since existing files in the output can be deleted.

That's true... However, the `-merge` is also supposed to be used for corpus minimization, and in this case it'll be doing even a better job.

> - The change seems a bit complex.  I wonder if there's a way to use the existing "greedy selection" approach to make things simpler.  Maybe if we combine the output corpus `SizedFile` vector into the input corpus `SizedFile` vector before sorting and writing the initial control file.

I didn't think it was possible as my perception of the `REDUCE` logic seemed to be wrong. Now, with your and Jonathan's comments, it feels like this should be possible to achieve through the existing greedy selection. Even if I have to loop through some files twice (not sure if that'll be needed, just speculating), that should be feasible, as it all happens in memory and doesn't invoke the target function.

================
Comment at: lib/fuzzer/FuzzerDriver.cpp:499
+  for (auto &Pair : ReplacedFiles) {
+    F->WriteToOutputCorpus(FileToVector(Pair.second, Options.MaxLen));
+    RemoveFile(Pair.first);
----------------
metzman wrote:
> I'm a bit confused why WriteToOutputCorpus isn't going to truncate tescases (which i'm pretty sure we don't want).
> I know your CL didn't introduce this, but do you know if this is the case?
Not sure I understood what exactly you meant:
- If you're asking about the contents of the unit (in memory), then I think it's not getting truncated here as the `MaxLen` has been applied earlier as well (e.g. when reading the inputs)
- If you're asking about truncating the actual corpus units on disk, I think it doesn't happen because this function calculates the filename based on the contents (through hash), i.e. a truncated file contents would be given a different filename

================
Comment at: lib/fuzzer/FuzzerFork.cpp:224
     }
+    // TODO(Dor1s): go through FilesToReplace here.
     Features.insert(NewFeatures.begin(), NewFeatures.end());
----------------
metzman wrote:
> How about doing the FilesToReplace loop inside of CrashResistantMerge instead so it doesn't need to be repeated and so that `CrashResistantMerge` doesn't need to be passed more arguments.
Hm, sounds like an option. Thanks!

================
Comment at: lib/fuzzer/FuzzerMerge.cpp:194
+
+    auto Signature = SignaturesMap.find(Files[i].Signature);
+    if (Signature != SignaturesMap.end()) {
----------------
metzman wrote:
> I'm not sure this approach with signatures works the same way reduction works within a fuzzing run (ie: the way I think we probably want). I'll have to think about this more by tomorrow, since I haven't 100% thought this through and the code dealing with this during fuzzing is tricky.
> 
> As I understand it, this implementation it only replaces initial corpus elements with new ones that are smaller and are exactly the same in coverage.
> However, [[ https://cs.chromium.org/chromium/src/third_party/libFuzzer/src/FuzzerCorpus.h?g=0&l=220 | during fuzzing ]] corpus elements are replaced if a new corpus element is found that is smaller and covers the features for which the initial element is the smallest testcase, i.e. testcases can be replaced by smaller testcases that aren't exactly equivalent.
> 
> Example:
> Unit A is the smallest element/unit covering feature X. A only covers feature X and no other feature.
> Then we find Unit B which is smaller than Unit A. B covers features X and is the only (or smallest) unit that covers Y.
> During fuzzing we will delete A and add B since B is now the smallest unit covering X.
> Here, if A were an initial testcase (primary/output corpus) and B a new one (secondary/input corpus), I think we don't delete A.
> 
> Is my example correct?
> 
> If so then there may be a problem if there is an initial Unit C that is already covering Y, I don't think B will get added to the merged corpus, since it isn't a pure reduction of any testcase.
Wow, that's a great point! I was thinking that the units should give exactly the same coverage. If that's not the case, I should be able to achieve what we need without introducing the signature stuff, and it'll likely look better :) Thanks for pointing this out!

Repository:
  rCRT Compiler Runtime

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D66107/new/

https://reviews.llvm.org/D66107