[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Wed Nov 30 17:59:53 PST 2016

pcc accepted this revision.
pcc added a comment.
This revision is now accepted and ready to land.

LGTM with nits

================
Comment at: include/llvm/Support/TrigramIndex.h:1
+//===-- TrigramIndex.h - optimization heuristic for SpecialCaseList ----*- C++ -*-===//
+//
----------------
80 cols

================
Comment at: include/llvm/Support/TrigramIndex.h:30-35
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringMap.h"
----------------
LLVM headers come before system headers

================
Comment at: include/llvm/Support/TrigramIndex.h:42
+ public:
+  TrigramIndex(): Defeated(false), Counts(), Index(256) {}
+
----------------
There's no need for a constructor on this class; you can let `Counts` be default initialized and use in-class initializers for the other members.

================
Comment at: lib/Support/CMakeLists.txt:97-98
   ToolOutputFile.cpp
   Triple.cpp
+  TrigramIndex.cpp
   Twine.cpp
----------------
Sort alphabetically

================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
krasin wrote:
> krasin wrote:
> > krasin wrote:
> > > pcc wrote:
> > > > krasin wrote:
> > > > > krasin wrote:
> > > > > > pcc wrote:
> > > > > > > krasin wrote:
> > > > > > > > pcc wrote:
> > > > > > > > > I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
> > > > > > > > Actually, pretty important. Consider a test case like the following:
> > > > > > > > 
> > > > > > > > blablablablabla
> > > > > > > > blabla
> > > > > > > > 
> > > > > > > > If we don't use Was, the index will look the following:
> > > > > > > > 
> > > > > > > > "bla" -> [0, 0, 0, 0]
> > > > > > > > "lab" -> [0, 0, 0, 0]
> > > > > > > > "abl" -> [0, 0, 0, 0]
> > > > > > > > 
> > > > > > > > and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.
> > > > > > > > 
> > > > > > > > In the current scheme the index will be:
> > > > > > > > "bla" -> [0, 1]
> > > > > > > > "lab" -> [0, 1]
> > > > > > > > "abl" -> [0, 1]
> > > > > > > > and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.
> > > > > > > > 
> > > > > > > > A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW
> > > > > > > > 
> > > > > > > > Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.
> > > > > > > > 
> > > > > > > > And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).
> > > > > > > Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.
> > > > > > > 
> > > > > > > As for your first point: it would seem that the index would be defeated under the current implementation with something like this:
> > > > > > > 
> > > > > > > blabla
> > > > > > > blablabla
> > > > > > > blablablabla
> > > > > > > blablablablabla
> > > > > > > blablablablablabla
> > > > > > > 
> > > > > > > or perhaps more realistically, something like:
> > > > > > > 
> > > > > > > class1.h
> > > > > > > class2.h
> > > > > > > class3.h
> > > > > > > class4.h
> > > > > > > class.h
> > > > > > > 
> > > > > > > It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.
> > > > > > > 
> > > > > > > Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).
> > > > > > Actually, in 
> > > > > > 
> > > > > > class1.h
> > > > > > class2.h
> > > > > > class3.h
> > > > > > class4.h
> > > > > > 
> > > > > > we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.
> > > > > > 
> > > > > > Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.
> > > > > By "fails" I mean that the corresponding TrigramIndex is defeated. The blacklist still works correctly, just slower.
> > > > Unless I'm mistaken, after adding class[1-4].h, the trigrams would look like this:
> > > > 
> > > > "cla" -> [0, 1, 2, 3]
> > > > "las" -> [0, 1, 2, 3]
> > > > "ass" -> [0, 1, 2, 3]
> > > > "ss1" -> [0]
> > > > "ss2" -> [1]
> > > > "ss3" -> [2]
> > > > "ss4" -> [3]
> > > > 
> > > > When we add "class.h", we will try to add only the trigrams "cla", "las" and "ass", all of which would fail due to exceeding the bound, no?
> > > This is correct, and regexp will be executed for the query. But the check will still correctly predict that we don't need to do that for queries like "foo" or "bar". So, while the check is not exact, it's not defeated either.
> > actually, not. The check will not execute query for class.h. I will add a test case to demonstrate that.
> What happens during class.h matching is the following: it considers "cla", "las", "ass" and makes counts for all 4 rules equal to 3. But they need to be at least 4 to trigger a check.
> 
> An example of a query that will go through is claclaass.h. In this case, the count will become 4, and isDefinitelyOut will return false.
To be clear, I think it's fine if the implementation isn't perfect; this is certainly a strict improvement over the status quo. I just wanted to point out one somewhat realistic case where this might fail.

================
Comment at: lib/Support/SpecialCaseList.cpp:52
+  std::vector<unsigned> Counts;
+  // Index holds a list of rules indices for each trigram.
+  // If a trigram is too common (>4 rules with it), we stop tracking it,
----------------
krasin wrote:
> pcc wrote:
> > This should specify that these are indices into Counts.
> Sorry, I have lost the context here after moving the code around. Probably addressed, but I am not sure.
This pertained to the comment "// Index holds a list of rules indices for each trigram.". It doesn't look like you've changed this. Note that you can click the double left arrow in the top left to see comments in their original context.

================
Comment at: lib/Support/TrigramIndex.cpp:1-15
+//===-- SpecialCaseList.cpp - special case list for sanitizers ------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is distributed under the University of Illinois Open Source
+// License. See LICENSE.TXT for details.
+//
----------------
This comment needs to be updated.

================
Comment at: lib/Support/TrigramIndex.cpp:17-22
+#include <unordered_map>
+#include <set>
+#include <string>
+
+#include "llvm/Support/TrigramIndex.h"
+#include "llvm/ADT/SmallVector.h"
----------------
LLVM headers come before system headers.

================
Comment at: lib/Support/TrigramIndex.cpp:24
+
+namespace llvm {
+
----------------
We normally write `using namespace llvm;` instead of enclosing the entire file in a namespace.

================
Comment at: unittests/Support/TrigramIndexTest.cpp:1
+//===- TrigramIndexTest.cpp - Unit tests for TrigramIndex -----------===//
+//
----------------
80 cols

================
Comment at: unittests/Support/TrigramIndexTest.cpp:10-15
+#include <string>
+#include <vector>
+
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/Support/TrigramIndex.h"
+#include "gtest/gtest.h"
----------------
LLVM headers come before system headers

https://reviews.llvm.org/D27188