[PATCH] D27188: Use trigrams to speed up SpecialCaseList.

Wed Nov 30 12:17:24 PST 2016

krasin added a comment.

Please, take another look.

================
Comment at: lib/Support/Regex.cpp:199
+  // Check for regex metacharacters other than '*' and '.'.
+  return Str.find_first_of(RegexAdvancedMetachars) == StringRef::npos;
+}
----------------
pcc wrote:
> Does this belong in `Regex`? It feels more like an implementation detail of `TrigramIndex`.
It belongs to Regex, because it uses RegexAdvancedMetachars constant that is very similar to RegexMetachars. If one is updated, another must be updated as well.

Also, isLiteralERE and isSimpleWildcard are very similar in meaning. If one lives in Regex, why not another?

================
Comment at: lib/Support/SpecialCaseList.cpp:34
+  std::vector<int> Counts;
+  std::unordered_map<unsigned, std::vector<size_t>> Index;
+
----------------
pcc wrote:
> Maybe use a SmallVector instead of std::vector since the size is bounded?
> 
> This code also needs more comments. Please at least leave comments for TrigramIndex and its fields.
SmallVector: done
Comments to TrigramIndex and its fields: done.

================
Comment at: lib/Support/SpecialCaseList.cpp:47
+    std::set<unsigned> Was;
+    size_t I = Counts.size();
+    unsigned Tri = 0;
----------------
pcc wrote:
> Maybe fold this into the use? I found it a little confusing to use a variable named "I" for something other than a loop counter.
Inlined and added a comment.

================
Comment at: lib/Support/SpecialCaseList.cpp:67
+      Cnt++;
+      if (!Was.count(Tri)) {
+        Index[Tri].push_back(I);
----------------
pcc wrote:
> I wonder how important it is to test `Was` here. It looks like things would still work if you allow trigrams to have multiple identical indices, and as you mention in the comment above it isn't worth worrying about popular trigrams.
Actually, pretty important. Consider a test case like the following:

blablablablabla
blabla

If we don't use Was, the index will look the following:

"bla" -> [0, 0, 0, 0]
"lab" -> [0, 0, 0, 0]
"abl" -> [0, 0, 0, 0]

and counts: [13, 0]. One of the counts is zero => check defeated => all queries go through a regex => SLOW.

In the current scheme the index will be:
"bla" -> [0, 1]
"lab" -> [0, 1]
"abl" -> [0, 1]
and counts: [13, 4]. That gives a way to filter out all queries without these trigrams.

A second issue is that when we count trigrams at eval time, duplicates will increase per-rule counts (while we still have non-inflated limits) => regexps are called much more often => SLOW

Unfortunately, while I have a test case like that, it does not really have a way to know if the check is defeated or not. I only test for correctness. It means that speed regressions like I have explained above are still possible without breaking any tests.

And I will need a separate header for the TrigramIndex to make a more granular unit test possible. I would be willing to extract it, if you think that is a good idea (as usual, I don't feel the conventions in the LLVM projects).

https://reviews.llvm.org/D27188