[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Wed Nov 30 17:07:09 PST 2016

krasin added inline comments.

================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
krasin wrote:
> krasin wrote:
> > pcc wrote:
> > > krasin wrote:
> > > > krasin wrote:
> > > > > pcc wrote:
> > > > > > krasin wrote:
> > > > > > > pcc wrote:
> > > > > > > > I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
> > > > > > > Actually, pretty important. Consider a test case like the following:
> > > > > > > 
> > > > > > > blablablablabla
> > > > > > > blabla
> > > > > > > 
> > > > > > > If we don't use Was, the index will look the following:
> > > > > > > 
> > > > > > > "bla" -> [0, 0, 0, 0]
> > > > > > > "lab" -> [0, 0, 0, 0]
> > > > > > > "abl" -> [0, 0, 0, 0]
> > > > > > > 
> > > > > > > and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.
> > > > > > > 
> > > > > > > In the current scheme the index will be:
> > > > > > > "bla" -> [0, 1]
> > > > > > > "lab" -> [0, 1]
> > > > > > > "abl" -> [0, 1]
> > > > > > > and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.
> > > > > > > 
> > > > > > > A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW
> > > > > > > 
> > > > > > > Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.
> > > > > > > 
> > > > > > > And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).
> > > > > > Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.
> > > > > > 
> > > > > > As for your first point: it would seem that the index would be defeated under the current implementation with something like this:
> > > > > > 
> > > > > > blabla
> > > > > > blablabla
> > > > > > blablablabla
> > > > > > blablablablabla
> > > > > > blablablablablabla
> > > > > > 
> > > > > > or perhaps more realistically, something like:
> > > > > > 
> > > > > > class1.h
> > > > > > class2.h
> > > > > > class3.h
> > > > > > class4.h
> > > > > > class.h
> > > > > > 
> > > > > > It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.
> > > > > > 
> > > > > > Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).
> > > > > Actually, in 
> > > > > 
> > > > > class1.h
> > > > > class2.h
> > > > > class3.h
> > > > > class4.h
> > > > > 
> > > > > we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.
> > > > > 
> > > > > Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.
> > > > By "fails" I mean that the corresponding TrigramIndex is defeated. The blacklist still works correctly, just slower.
> > > Unless I'm mistaken, after adding class[1-4].h, the trigrams would look like this:
> > > 
> > > "cla" -> [0, 1, 2, 3]
> > > "las" -> [0, 1, 2, 3]
> > > "ass" -> [0, 1, 2, 3]
> > > "ss1" -> [0]
> > > "ss2" -> [1]
> > > "ss3" -> [2]
> > > "ss4" -> [3]
> > > 
> > > When we add "class.h", we will try to add only the trigrams "cla", "las" and "ass", all of which would fail due to exceeding the bound, no?
> > This is correct, and regexp will be executed for the query. But the check will still correctly predict that we don't need to do that for queries like "foo" or "bar". So, while the check is not exact, it's not defeated either.
> actually, not. The check will not execute query for class.h. I will add a test case to demonstrate that.
What happens during class.h matching is the following: it considers "cla", "las", "ass" and makes counts for all 4 rules equal to 3. But they need to be at least 4 to trigger a check.

An example of a query that will go through is claclaass.h. In this case, the count will become 4, and isDefinitelyOut will return false.

https://reviews.llvm.org/D27188