[PATCH] D73776: Entropic: Boosting LibFuzzer Performance
marcel via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Feb 1 18:52:26 PST 2020
marcel marked 10 inline comments as done.
marcel added a subscriber: Valentin.
marcel added a comment.
Thanks so much for your feedback!
1. We maintain feature abundance counts globally (for the entire fuzzing campaign) and locally (for each `InputInfo`).
- Whenever an input generated by fuzzing `II` executes feature `f`, we increment the global abundance count for `f` as well as the local abundance count for `f` @ `II`.
- The //global abundance counts// are stored in `FeaturesFreqMap`, an array of size `kFeatureSetSize`.
- It requires one read/write for *each* `UpdateFeatureFrequency` which is very hot (Line 331). To maximize efficiency, I decided to use a fixed-size array. There is only one global map. So, even though the map is fairly sparse, memory overhead might be okay.
- Currently, each element is of `size_t` because a feature might be executed as often as new inputs are generated. The total number of inputs generated (`g_NumExecutedMutations`) will likely not fit into `uint32_t`.
- However, if memory footprint is a concern over efficiency, we might even go down as much as `uint16_t` since only abundance counts below `MostAbundant_RareFeature` are relevant. This adds an overflow check to each (hot!) write (Line 331).
- The //local abundance counts// are stored with each `InputInfo` in `RareFeaturesFreqMap`, and `unordered_map<size_t,size_t>`
- It is written to in `UpdateFeatureFrequency` but only if the current feature is globally rare. In most cases `UpdateFeatureFrequency` returns in Line 337 before the local feature count is updated.
- It is read when we iterate over all keys (and values) to compute the energy for an input in `UpdateEnergy`.
- We can go faster with other hashmaps. Most efficient would be a fixed size array that is simply indexed with the feature ID (like the global abundance counts). However, the arrays are very sparse and the memory footprint should be too high.
- There is also some reading of the global and local abundance counts in `AddRareFeature` (Line 244), but that should happen quite rarely.
2. How do we decide what is considered a globally rare feature?
- Currently, we fixed this definition quite arbitrary as: {the top 100 least abundant features} UNION {all features with abundance below 0xFF}.
- When a feature becomes abundant, it is removed globally and from each `II` locally. This is handled in `AddRareFeature` (which also updates `MostAbundant_RareFeature`)
- If you allow too many features to be considered as rare, you will spend more time computing the energy for a seed.
- If you allow too few features to considered as rare, your performance gain from Entropic might decrease.
- Not sure whether we already hit the sweet spot. We can pull these out as command line options to see what works.
In addition, we will
- add a command line flag to enable entropic for libfuzzer,
- pull out the constants either as command line options or as global constants,
- turn most occurrences of `long double` to `double` (except in `UpdateEnergy` where we need the precision, I think).
- update variable names, add comments, and
- see how we can unit test our patch (@Valentin?)
Let me know what you think.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:46
+ long double Sum_Y;
+ std::unordered_map<size_t, size_t> RareFeaturesFreqMap;
};
----------------
kcc wrote:
> Any chance to have a planer data structure here? (sorted array or some such)
> Also, why size_t?
> The feature is 4 bytes and the frequency can probably be 4 or even 2 bytes. (2 bytes is hard for alignment).
Whether or not unorder_map, see top-level comment. If unordered_map, then uint32_t for keys and values should work.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:53
+public:
+ size_t g_NumExecutedMutations = 0;
InputCorpus(const std::string &OutputCorpus) : OutputCorpus(OutputCorpus) {
----------------
kcc wrote:
> why is this public?
> Also, why the g_ prefix?
`g_NumExecutedMutations` contains the total number of inputs generated. It is updated in FuzzerLoop.c in `MutateAndTestOne`. I can drop the `g_` prefix.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:250
+ // Firstly, remove most abundant feature until we are "under the line".
+ while (RareFeatures.size() > 100 && MostAbundant_RareFeature > 0xFF) {
+ size_t st_mostAbundant_RareFeature_Idx =
----------------
kcc wrote:
> try not to use constants like this.
> At the very least, use
> const size_t kSomething = 100;
> ..
> if (xxx.size() > kSomething)
>
> if there is any reason to play with different value, you may want to use a flag instead
>
> Similar for 0xFF.
>
> Probably, the best here is to replace these two constants with function parameters so that this function becomes unit-testable.
> Which reminds me: please consider adding unit tests to the key functionality like this one.
>
We will pull these constants out as command line options.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:328
+ void UpdateFeatureFrequency(InputInfo *II, size_t Idx) {
+ Idx = Idx % kFeatureSetSize;
----------------
kcc wrote:
> This is the hotspot, right?
> My guess is that using a hash map here causes most of the slowdown, need a faster data structure...
> Need to think...
> It will help if you have describe the algorithm in a comment.
Yes. This is the hot spot. However, most executions should exit in Line 337.
FeaturesFreqMap is an array. Don't think you can get much faster here.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerCorpus.h:500
+ bool DistributionNeedsUpdate = true;
+ size_t FeaturesFreqMap[kFeatureSetSize];
+ Vector<size_t> RareFeatures;
----------------
kcc wrote:
> does this have to be 8-byte per feature?
Each entry is upper-bounded by the total number of generated inputs `g_NumExecutedMutations` which likely won't fit into `uint32_t`. That's why I chose size_t.
However, we really only need abundance information for features with an abundance below `MostAbundant_RareFeature`. If memory footprint is a concern, we can go down to `uint16_t` at the cost of an overflow check in the hot code in `UpdateFeatureFrequency`. See top-level comment for more details.
================
Comment at: compiler-rt/lib/fuzzer/FuzzerLoop.cpp:607
-void Fuzzer::ReportNewCoverage(InputInfo *II, const Unit &U) {
+void Fuzzer::ReportNewCoverage(bool reduced, InputInfo *II, const Unit &U) {
II->NumSuccessfullMutations++;
----------------
kcc wrote:
> do you need this change?
Unrelated. This is just fixing a problem where LibFuzzer prints REDUCE more often than it should.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D73776/new/
https://reviews.llvm.org/D73776
More information about the llvm-commits
mailing list