[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Peter Collingbourne via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Wed Nov 30 16:54:28 PST 2016


pcc added inline comments.


================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
krasin wrote:
> krasin wrote:
> > pcc wrote:
> > > krasin wrote:
> > > > pcc wrote:
> > > > > I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
> > > > Actually, pretty important. Consider a test case like the following:
> > > > 
> > > > blablablablabla
> > > > blabla
> > > > 
> > > > If we don't use Was, the index will look the following:
> > > > 
> > > > "bla" -> [0, 0, 0, 0]
> > > > "lab" -> [0, 0, 0, 0]
> > > > "abl" -> [0, 0, 0, 0]
> > > > 
> > > > and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.
> > > > 
> > > > In the current scheme the index will be:
> > > > "bla" -> [0, 1]
> > > > "lab" -> [0, 1]
> > > > "abl" -> [0, 1]
> > > > and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.
> > > > 
> > > > A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW
> > > > 
> > > > Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.
> > > > 
> > > > And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).
> > > Okay, I understand why this is necessary now; I hadn't had the insight that duplicate indices would effectively square the per-rule count.
> > > 
> > > As for your first point: it would seem that the index would be defeated under the current implementation with something like this:
> > > 
> > > blabla
> > > blablabla
> > > blablablabla
> > > blablablablabla
> > > blablablablablabla
> > > 
> > > or perhaps more realistically, something like:
> > > 
> > > class1.h
> > > class2.h
> > > class3.h
> > > class4.h
> > > class.h
> > > 
> > > It isn't clear how likely this would be in practice, though, and I suspect that you could mitigate against the more realistic scenarios by sorting the patterns by length beforehand.
> > > 
> > > Regarding testing: yes, we generally tend to extract algorithmically tricky classes and write unit tests for them (see the tests for CFI and devirtualization internals in llvm/unittests/Transforms/IPO).
> > Actually, in 
> > 
> > class1.h
> > class2.h
> > class3.h
> > class4.h
> > 
> > we have trigrams ss1, ss2, ss3, ss4 which will help us to keep it working.
> > 
> > Realistically, the current implementation works for CFI blacklist and almost works for UBSanVptr blacklist in Chromium. The latter fails due to using \+ in the regex, and I hope to address escaping in the next CL.
> By "fails" I mean that the corresponding TrigramIndex is defeated. The blacklist still works correctly, just slower.
Unless I'm mistaken, after adding class[1-4].h, the trigrams would look like this:

"cla" -> [0, 1, 2, 3]
"las" -> [0, 1, 2, 3]
"ass" -> [0, 1, 2, 3]
"ss1" -> [0]
"ss2" -> [1]
"ss3" -> [2]
"ss4" -> [3]

When we add "class.h", we will try to add only the trigrams "cla", "las" and "ass", all of which would fail due to exceeding the bound, no?


https://reviews.llvm.org/D27188





More information about the llvm-commits mailing list