[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Wed Nov 30 16:49:27 PST 2016

krasin marked an inline comment as done.
krasin added a comment.

Please, take another look.

================
Comment at: lib/Support/Regex.cpp:199
+  // Check for regex metacharacters other than '*' and '.'.
+  return Str.find_first_of(RegexAdvancedMetachars) == StringRef::npos;
+}
----------------
pcc wrote:
> krasin wrote:
> > pcc wrote:
> > > Does this belong in `Regex`? It feels more like an implementation detail of `TrigramIndex`.
> > It belongs to Regex, because it uses RegexAdvancedMetachars constant that is very similar to RegexMetachars. If one is updated, another must be updated as well.
> > 
> > Also, isLiteralERE and isSimpleWildcard are very similar in meaning. If one lives in Regex, why not another?
> I doubt that we'd actually be changing the definition of what an ERE is, the specification comes from POSIX and people are relying on the current semantics. On the other hand, the set of metacharacters for isSimpleWildcard is a property of how powerful TrigramIndex is, and I'd certainly expect it to change if we made TrigramIndex more powerful.
> 
> Besides, only SpecialCaseList mangles * into .*, which would seem to be critical to how we define this character list (i.e. you wouldn't be able to extract the trigram "abc" from the pattern "abc*" without this mangling).
>On the other hand, the set of metacharacters for isSimpleWildcard is a property of how powerful TrigramIndex is, and I'd certainly expect it to change if we made TrigramIndex more powerful.

This is a good argument. I buy it. Done.

================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
pcc wrote:
> krasin wrote:
> > pcc wrote:
> > > I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
> > Actually, pretty important. Consider a test case like the following:
> > 
> > blablablablabla
> > blabla
> > 
> > If we don't use Was, the index will look the following:
> > 
> > "bla" -> [0, 0, 0, 0]
> > "lab" -> [0, 0, 0, 0]
> > "abl" -> [0, 0, 0, 0]
> > 
> > and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.
> > 
> > In the current scheme the index will be:
> > "bla" -> [0, 1]
> > "lab" -> [0, 1]
> > "abl" -> [0, 1]
> > and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.
> > 
> > A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW
> > 
> > Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.
> > 
> > And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).
> Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.
> 
> As for your first point: it would seem that the index would be defeated under the current implementation with something like this:
> 
> blabla
> blablabla
> blablablabla
> blablablablabla
> blablablablablabla
> 
> or perhaps more realistically, something like:
> 
> class1.h
> class2.h
> class3.h
> class4.h
> class.h
> 
> It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.
> 
> Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).
Actually, in 

class1.h
class2.h
class3.h
class4.h

we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.

Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.

================
Comment at: lib/Support/SpecialCaseList.cpp:52
+  std::vector<unsigned> Counts;
+  // Index holds a list of rules indices for each trigram.
+  // If a trigram is too common (>4 rules with it), we stop tracking it,
----------------
pcc wrote:
> This should specify that these are indices into Counts.
Sorry, I have lost the context here after moving the code around. Probably addressed, but I am not sure.

https://reviews.llvm.org/D27188