[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Wed Nov 30 16:51:12 PST 2016

krasin added inline comments.

================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
krasin wrote:
> pcc wrote:
> > krasin wrote:
> > > pcc wrote:
> > > > I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
> > > Actually, pretty important. Consider a test case like the following:
> > > 
> > > blablablablabla
> > > blabla
> > > 
> > > If we don't use Was, the index will look the following:
> > > 
> > > "bla" -> [0, 0, 0, 0]
> > > "lab" -> [0, 0, 0, 0]
> > > "abl" -> [0, 0, 0, 0]
> > > 
> > > and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.
> > > 
> > > In the current scheme the index will be:
> > > "bla" -> [0, 1]
> > > "lab" -> [0, 1]
> > > "abl" -> [0, 1]
> > > and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.
> > > 
> > > A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW
> > > 
> > > Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.
> > > 
> > > And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).
> > Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.
> > 
> > As for your first point: it would seem that the index would be defeated under the current implementation with something like this:
> > 
> > blabla
> > blablabla
> > blablablabla
> > blablablablabla
> > blablablablablabla
> > 
> > or perhaps more realistically, something like:
> > 
> > class1.h
> > class2.h
> > class3.h
> > class4.h
> > class.h
> > 
> > It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.
> > 
> > Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).
> Actually, in 
> 
> class1.h
> class2.h
> class3.h
> class4.h
> 
> we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.
> 
> Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.
By "fails" I mean that the corresponding TrigramIndex is defeated. The blacklist still works correctly, just slower.

https://reviews.llvm.org/D27188