[PATCH] D73776: Entropic: Boosting LibFuzzer Performance

Sat Feb 1 18:52:26 PST 2020

marcel marked 10 inline comments as done.
marcel added a subscriber: Valentin.
marcel added a comment.

Thanks so much for your feedback!

1. We maintain feature abundance counts globally (for the entire fuzzing campaign) and locally (for each `InputInfo`).
  - Whenever an input generated by fuzzing `II` executes feature `f`, we increment the global abundance count for `f` as well as the local abundance count for `f` @ `II`.
  - The //global abundance counts// are stored in `FeaturesFreqMap`, an array of size `kFeatureSetSize`.
    - It requires one read/write for *each* `UpdateFeatureFrequency` which is very hot (Line 331). To maximize efficiency, I decided to use a fixed-size array. There is only one global map. So, even though the map is fairly sparse, memory overhead might be okay.
    - Currently, each element is of `size_t` because a feature might be executed as often as new inputs are generated. The total number of inputs generated (`g_NumExecutedMutations`) will likely not fit into `uint32_t`.
    - However, if memory footprint is a concern over efficiency, we might even go down as much as `uint16_t` since only abundance counts below `MostAbundant_RareFeature` are relevant. This adds an overflow check to each (hot!) write (Line 331).
  - The //local abundance counts// are stored with each `InputInfo` in `RareFeaturesFreqMap`, and `unordered_map<size_t,size_t>`
    - It is written to in `UpdateFeatureFrequency` but only if the current feature is globally rare. In most cases `UpdateFeatureFrequency` returns in Line 337 before the local feature count is updated.
    - It is read when we iterate over all keys (and values) to compute the energy for an input in `UpdateEnergy`.
    - We can go faster with other hashmaps. Most efficient would be a fixed size array that is simply indexed with the feature ID (like the global abundance counts). However, the arrays are very sparse and the memory footprint should be too high.
  - There is also some reading of the global and local abundance counts in `AddRareFeature` (Line 244), but that should happen quite rarely.
2. How do we decide what is considered a globally rare feature?
  - Currently, we fixed this definition quite arbitrary as: {the top 100 least abundant features} UNION {all features with abundance below 0xFF}.
  - When a feature becomes abundant, it is removed globally and from each `II` locally. This is handled in `AddRareFeature` (which also updates `MostAbundant_RareFeature`)
  - If you allow too many features to be considered as rare, you will spend more time computing the energy for a seed.
  - If you allow too few features to considered as rare, your performance gain from Entropic might decrease.
  - Not sure whether we already hit the sweet spot. We can pull these out as command line options to see what works.

In addition, we will

- add a command line flag to enable entropic for libfuzzer,
- pull out the constants either as command line options or as global constants,
- turn most occurrences of `long double` to `double` (except in `UpdateEnergy` where we need the precision, I think).
- update variable names, add comments, and
- see how we can unit test our patch (@Valentin?)

Let me know what you think.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:46
+  long double Sum_Y;
+  std::unordered_map<size_t, size_t> RareFeaturesFreqMap;
 };
----------------
kcc wrote:
> Any chance to have a planer data structure here? (sorted array or some such)
> Also, why size_t? 
> The feature is 4 bytes and the frequency can probably be 4 or even 2 bytes. (2 bytes is hard for alignment). 
Whether or not unorder_map, see top-level comment. If unordered_map, then uint32_t for keys and values should work.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:53
+public:
+  size_t g_NumExecutedMutations = 0;
   InputCorpus(const std::string &OutputCorpus) : OutputCorpus(OutputCorpus) {
----------------
kcc wrote:
> why is this public? 
> Also, why the g_ prefix? 
`g_NumExecutedMutations` contains the total number of inputs generated. It is updated in FuzzerLoop.c in `MutateAndTestOne`. I can drop the `g_` prefix.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:250
+    // Firstly, remove most abundant feature until we are "under the line".
+    while (RareFeatures.size() > 100 && MostAbundant_RareFeature > 0xFF) {
+      size_t st_mostAbundant_RareFeature_Idx =
----------------
kcc wrote:
> try not to use constants like this. 
> At the very least, use
>   const size_t kSomething = 100;
>   ..
>   if (xxx.size() > kSomething)
> 
> if there is any reason to play with different value, you may want to use a flag instead
> 
> Similar for 0xFF. 
> 
> Probably, the best here is to replace these two constants with function parameters so that this function becomes unit-testable. 
> Which reminds me: please consider adding unit tests to the key functionality like this one. 
> 
We will pull these constants out as command line options.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:328

+  void UpdateFeatureFrequency(InputInfo *II, size_t Idx) {
+    Idx = Idx % kFeatureSetSize;
----------------
kcc wrote:
> This is the hotspot, right? 
> My guess is that using a hash map here causes most of the slowdown, need a faster data structure... 
> Need to think... 
> It will help if you have describe the algorithm in a comment. 
Yes. This is the hot spot. However, most executions should exit in Line 337.
FeaturesFreqMap is an array. Don't think you can get much faster here.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:500
+  bool DistributionNeedsUpdate = true;
+  size_t FeaturesFreqMap[kFeatureSetSize];
+  Vector<size_t> RareFeatures;
----------------
kcc wrote:
> does this have to be 8-byte per feature? 
Each entry is upper-bounded by the total number of generated inputs `g_NumExecutedMutations` which likely won't fit into `uint32_t`. That's why I chose size_t.

However, we really only need abundance information for features with an abundance below `MostAbundant_RareFeature`. If memory footprint is a concern, we can go down to `uint16_t` at the cost of an overflow check in the hot code in `UpdateFeatureFrequency`. See top-level comment for more details.

================
Comment at: compiler-rt/lib/fuzzer/FuzzerLoop.cpp:607

-void Fuzzer::ReportNewCoverage(InputInfo *II, const Unit &U) {
+void Fuzzer::ReportNewCoverage(bool reduced, InputInfo *II, const Unit &U) {
   II->NumSuccessfullMutations++;
----------------
kcc wrote:
> do you need this change? 
Unrelated. This is just fixing a problem where LibFuzzer prints REDUCE more often than it should.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73776/new/

https://reviews.llvm.org/D73776