[PATCH] D44720: [clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Wed Mar 28 17:59:28 PDT 2018


sammccall added a comment.

Sorry for the delay, it took me a while to understand exactly what everything is doing.
If I understand right, there's actually no functional change (to match logic or scoring) being proposed here.
But some nice fixes indeed!

Most of the comments are readability nits. The existing code is pretty dense and hard to follow (my fault, thanks for picking through it!) so I might have misunderstood some things.



================
Comment at: clangd/FuzzyMatch.cpp:93
+  for (int I = PatN; I <= WordN; I++)
+    Best = std::max(Best, Scores[PatN][I][Match].Score);
   if (isAwful(Best))
----------------
MaskRay wrote:
> sammccall wrote:
> > this looks like a behavior change - why?
> This is a behavior change. Instead of choosing between `Match/Miss` in the last position, we enumerate the last matching position in `Word`.
> 
> This saves `if (P < PatN - 1) {` check in the main loop at the cost of a for loop here (use sites of ending values)
Ah, I see - the case where we match only part of the word is handled up here now.
(I think you mean this is not a behavior change? The result is the same AFAICS)

That does make more sense, but it's pretty subtle.
Can you add a comment like
 `// The pattern doesn't have to match the whole word (but the whole pattern must match).`


================
Comment at: clangd/FuzzyMatch.cpp:96
     return None;
   return ScoreScale * std::min(PerfectBonus * PatN, std::max<int>(0, Best));
 }
----------------
MaskRay wrote:
> I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here.
An important part of the contract of `match()` is that it returns a value in `[0,1]`.
We rely on this range to combine this with other scoring signals - we multiply this by a quality signal in code completion.
(Currently the quality signal is just derived from Sema, but the global index will provide the number of uses).

It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me.
(Then I tuned the bonuses/penalties so the floor was at zero)


================
Comment at: clangd/FuzzyMatch.cpp:97
     return None;
-  if (!PatN)
-    return 1;
----------------
I'd prefer to keep this - the empty pattern case is very common, and buildGraph() isn't trivially cheap in this case.


================
Comment at: clangd/FuzzyMatch.cpp:174
                                   int N) {
-  assert(N > 0);
+  if (!N)
+    return;
----------------
Why this change? Previously this check was dynamic at the callsite in the constructor (which is cold), and omitted in the call to init() which is relatively hot.

Generally, here we expect the constructor to be called once per request and match() to be called thousands of times, so it's ok to do some wasteful initialization/work in the constructor, but we should avoid it on the match() path.


================
Comment at: clangd/FuzzyMatch.cpp:209
   std::copy(NewWord.begin(), NewWord.begin() + WordN, Word);
-  if (PatN == 0)
-    return true;
----------------
similarly this one.
(ideally we wouldn't do the work above, it's just there to make dumpLast work I think)


================
Comment at: clangd/FuzzyMatch.cpp:230
 void FuzzyMatcher::buildGraph() {
+  Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};
   for (int W = 0; W < WordN; ++W) {
----------------
why this change?
this has also been moved from the cheaper constructor to the more expensive per-match call. (also the diagonal assignment added in the next loop)

Also, shouldn't [0][0][Match] be AwfulScore?



================
Comment at: clangd/FuzzyMatch.cpp:325
+  int W = PatN;
+  for (int I = PatN; ++I <= WordN; )
+    if (Scores[PatN][I][Match].Score > Scores[PatN][W][Match].Score)
----------------
nit: I -> P, move increment to the increment expression of the for loop?


================
Comment at: clangd/FuzzyMatch.cpp:340
+  A[WordN] = Miss;
+  for (int I = WordN, P = PatN; I > 0; I--) {
+    if (I == W)
----------------
W is the right name in this file for a variable iterating over word indices, please don't change this.
The new variable above could be EndW or so?


================
Comment at: clangd/FuzzyMatch.cpp:340
+  A[WordN] = Miss;
+  for (int I = WordN, P = PatN; I > 0; I--) {
+    if (I == W)
----------------
sammccall wrote:
> W is the right name in this file for a variable iterating over word indices, please don't change this.
> The new variable above could be EndW or so?
As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with the exsting logic.
I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening.


================
Comment at: clangd/FuzzyMatch.cpp:340
+  A[WordN] = Miss;
+  for (int I = WordN, P = PatN; I > 0; I--) {
+    if (I == W)
----------------
sammccall wrote:
> sammccall wrote:
> > W is the right name in this file for a variable iterating over word indices, please don't change this.
> > The new variable above could be EndW or so?
> As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with the exsting logic.
> I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening.
You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering.


================
Comment at: clangd/FuzzyMatch.cpp:359
   }
-  if (A[WordN - 1] == Match)
-    Result.push_back(']');
 
   for (char C : StringRef(Word, WordN))
----------------
now that the end of the word could be anywhere, it might be nice to add an arrowhead `^' below the table pointing at it :) up to you


Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D44720





More information about the cfe-commits mailing list