[PATCH] D40060: [clangd] Fuzzy match scorer

Thu Nov 30 10:46:13 PST 2017

sammccall added a comment.

Thanks @ilya-biryukov, @inspirer, @klimek for the helpful comments!

I've addressed hopefully the most important and added more rigorous testing.
Sorry for the large delta, the most invasive change was of course adding the extra dimension to the scoring table. (Which fixed a bunch of problems)

================
Comment at: clangd/FuzzyMatch.cpp:118
+    0x00, 0x00, 0x00, 0x00, // Control characters
+    0xff, 0xff, 0xff, 0xff, // Punctuation
+    0x55, 0x55, 0xf5, 0xff, // Numbers->Lower, more Punctuation.
----------------
ilya-biryukov wrote:
> I'm not sure if we care, but maybe we should treat `+`, `-` and other symbols that could be in operator names (e.g., `operator +`) differently for C++.
> Could also make sense for other languages with overloaded operators.
You might be right, but in the absence of concrete problems I think treating them as punctuation is actually the most conservative thing to do.

E.g. matching [op=] against "operator=" gets big penalties if we treat '=' as Lower, and treating it as Upper seems likely to have other weird effects... Punctuation/separators are treated pretty neutrally.

================
Comment at: clangd/FuzzyMatch.cpp:245
+  if (!P && WordRole[W] == Tail)
+    return AwfulScore;
+  ScoreT S = 0;
----------------
ilya-biryukov wrote:
> Does it mean I will get no matches in the following situation?
> 
> `Items = [printf, scanf]`
> `Pattern = f`
> 
> It may be a bit confusing, since I do have a match, even though is terrible and it's ok to put those items very low in the list.
> A more real example is:
> `Items = [fprintf, fscanf]`
> `Pattern = print`
> 
> Would `fprintf` match in that case? I think it should.
> 
> Another important one:
> `Items = [istream, ostream]`
> `Pattern = stream`
Done. VSCode will filter these out, but I agree these are important and don't seem to cause problems.

================
Comment at: clangd/FuzzyMatch.cpp:254
+  // Penalty: matching inside a segment (and previous char wasn't matched).
+  if (WordRole[W] == Tail && P && !Matched[P - 1][W - 1])
+    S -= 3;
----------------
inspirer wrote:
> You need a third boolean dimension in your DP table for this condition to work - "matches".
> 
> Consider matching "Abde" against "AbdDe". The result should be [Ab]d[De] and not [Abd]D[e]. While evaluating Abd against AbdD, you will have to choose between two ways to represent the match and no matter what you choose, scoring in this line will not know whether your previous char matched, since you merged two branches and kept only one of them.
> 
> This scoring works OK-ish since you check "if (Diag >= Left)" above and so you Matched table is full of trues, but you matches will gravitate towards the ends of the candidate string if you decide to show them in the UI.
Thank you for this! Fixed. The naming around Scores/ScoreInfo is a bit clumsy, happy to take suggestions :-(

I've also made all our tests assert the exact characters matched. We don't have an API or need this feature, but it makes the tests detect a lot more misbehavior that's hard to capture otherwise.

================
Comment at: clangd/FuzzyMatch.h:53
+
+  int PatN, WordN;                               // Length of pattern and word.
+  char Pat[MaxPat], Word[MaxWord];               // Raw pattern and word data.
----------------
ilya-biryukov wrote:
> Maybe we could split the data we store into two sections:
> 1. Pattern-specific data. Initialized on construction, never changed later.
> 2. Per-match data. Initialized per `match()` call.
> 
> Otherwise it is somewhat hard to certify whether everything is being initialized properly.
This hides the parallels between the Pattern and Word data, I'm not sure I like it better overall.

I've added a comment describing this split, reordered some variables, and renamed IsSubstring to WordContainsPattern, which I think clarifies this a bit. WDYT?

https://reviews.llvm.org/D40060