[PATCH] D44720: [clangd] Simplify fuzzy matcher (sequence alignment) by removing some condition checks.

Thu Mar 29 11:10:53 PDT 2018

sammccall added inline comments.

================
Comment at: clangd/FuzzyMatch.cpp:96
     return None;
   return ScoreScale * std::min(PerfectBonus * PatN, std::max<int>(0, Best));
 }
----------------
MaskRay wrote:
> sammccall wrote:
> > MaskRay wrote:
> > > I also don't understand why it clamps the value to zero here. Negative values are also meaningful to me. Given that perfectBonus is only 3 it is very easy to get a negative value here.
> > An important part of the contract of `match()` is that it returns a value in `[0,1]`.
> > We rely on this range to combine this with other scoring signals - we multiply this by a quality signal in code completion.
> > (Currently the quality signal is just derived from Sema, but the global index will provide the number of uses).
> > 
> > It would be possible to use a different squash function here, but I found max(kFloor,x) worked well for the examples I looked at - anything <= some floor value was "not really a useful match at all", and most of the variance below the floor seemed to be noise to me.
> > (Then I tuned the bonuses/penalties so the floor was at zero)
> We could try other criteria in the future. I believe the current one can be improved because negative scores may be returned but the scoring shouldn't return 0 for all the cases.
Sure, we can try other things, and to gather more data.
(To be clear though - with the data I *did* look at, including the scores <0 did not add more information, only noise)

================
Comment at: clangd/FuzzyMatch.cpp:230
 void FuzzyMatcher::buildGraph() {
+  Scores[0][0][Miss] = Scores[0][0][Match] = {0, Miss};
   for (int W = 0; W < WordN; ++W) {
----------------
MaskRay wrote:
> sammccall wrote:
> > why this change?
> > this has also been moved from the cheaper constructor to the more expensive per-match call. (also the diagonal assignment added in the next loop)
> > 
> > Also, shouldn't [0][0][Match] be AwfulScore?
> > 
> "The more expensive per-match call" is just two value assignments.
> 
> I have removed the expensive table initialization in the constructor.
> 
> [0][0][*] can be any value.
> "The more expensive per-match call" is just two value assignments.
Oops, sorry - by "more expensive" I mean "called thousands of times more often".

> I have removed the expensive table initialization in the constructor.
I don't want to be rude, but I asked why you changed this, and you didn't answer. Unless there's a strong reason, I'd prefer to revert this change, as I find this harder to reason about.
(Roughly: in the old version of the code, any data that didn't need to change for the life of the object was initialized in the constructor. That way I didn't need to worry what was performance-critical and what wasn't - match() only did what was strictly necessary).

> [0][0][*] can be any value.
Can you please explain why?

================
Comment at: clangd/FuzzyMatch.cpp:325
+  int W = PatN;
+  for (int I = PatN; ++I <= WordN; )
+    if (Scores[PatN][I][Match].Score > Scores[PatN][W][Match].Score)
----------------
MaskRay wrote:
> sammccall wrote:
> > nit: I -> P, move increment to the increment expression of the for loop?
> I -> P.
> 
> > move increment to the increment expression of the for loop?
> 
> Not sure about the coding standard here, but if you insist I'll have to change it as you are the reviewer. If the loop variable was an iterator, `for (It I = std::next(...); I != E; ++I)` would be uglier than `for (It I = ...; ++I != E; )`
Uglier is subjective, but side-effects in the condition of a for-loop is sufficiently unusual and surprising that I'd prefer to avoid it in both cases.

================
Comment at: clangd/FuzzyMatch.cpp:340
+  A[WordN] = Miss;
+  for (int I = WordN, P = PatN; I > 0; I--) {
+    if (I == W)
----------------
MaskRay wrote:
> sammccall wrote:
> > sammccall wrote:
> > > sammccall wrote:
> > > > W is the right name in this file for a variable iterating over word indices, please don't change this.
> > > > The new variable above could be EndW or so?
> > > As far as I can see, this loop is setting `A[W+1:...] = Miss` and populating `A[0...W]` with the exsting logic.
> > > I think this would be clearer as two loops, currently there's a lot of conditionals around Last that obscure what's actually happening.
> > You've shifted P (and the old W, new I) by 1. This does reduce the number of +1 and -1 in this function, but it's inconsistent with how these are used elsewhere: P should be the index into Pat of the character that we're considering.
> I don't understand the rationale not to use the shifted indices. The code actually use `Scores[P][W][*]` to mean the optimal match of the first `P` characters of the pattern with the first `W` characters of the word, not the position of the character.
> 
> On the other hand, C++ reverse iterators use the shifted one for `for (I = rend(); I != rbegin(); ++I)`. The shifted one makes ending condition check easier.
> I don't understand the rationale not to use the shifted indices
The rationale is entirely consistency with the surrounding code. The consistency helps avoid off-by-one errors when similar loops have different conventions.

In this file, when looping over word or pattern dimensions, P and W respectively are used for loop variables, and can be interpreted as indices into Pat/Word.
Here the interpretation would be "did we match or miss character Word[W]"

================
Comment at: clangd/FuzzyMatch.cpp:93
+  for (int I = PatN; I <= WordN; I++)
+    Best = std::max(Best, Scores[PatN][I][Match].Score);
   if (isAwful(Best))
----------------
MaskRay wrote:
> sammccall wrote:
> > MaskRay wrote:
> > > sammccall wrote:
> > > > this looks like a behavior change - why?
> > > This is a behavior change. Instead of choosing between `Match/Miss` in the last position, we enumerate the last matching position in `Word`.
> > > 
> > > This saves `if (P < PatN - 1) {` check in the main loop at the cost of a for loop here (use sites of ending values)
> > Ah, I see - the case where we match only part of the word is handled up here now.
> > (I think you mean this is not a behavior change? The result is the same AFAICS)
> > 
> > That does make more sense, but it's pretty subtle.
> > Can you add a comment like
> >  `// The pattern doesn't have to match the whole word (but the whole pattern must match).`
> Added
> ```
>   // Find the optimal prefix of Word to match Pattern.
> ```
> 
> I meant this is a behavior change but it makes the first row and the rest rows of the score table more consistent.
That comment really doesn't capture what's significant about this line - it's the *policy*, rather than the mechanism, that needs highlighting here.

(Re: behavior change - I *think* there's no inputs for which we produce a different match result/score because of this patch, right?)

================
Comment at: clangd/FuzzyMatch.h:61
   bool allowMatch(int P, int W) const;
-  int skipPenalty(int W, Action Last) const;
-  int matchBonus(int P, int W, Action Last) const;
+  int missScore(int W, Action Last) const;
+  int matchScore(int P, int W, Action Last) const;
----------------
MaskRay wrote:
> MaskRay wrote:
> > sammccall wrote:
> > > FWIW, I don't think this is an improvement - I think the clarity of purpose in names is more useful than having consistent signs in this case.
> > Keep `matchBonus` but rename `skipPenalty` to `missPenalty` ?
> Also note in the original scheme, the skip score does not need to be negative. Because Scores[PatN][WordN][] is used and each path takes the same number of positions (WordN). We can add an offset to all positional scores to make all of them non-negative. In the new scheme it does make sense to make them negative, as each path has now different number of positions.
missPenalty SGTM.

(I don't see any particular reason to avoid negative numbers here - it has a natural interpretation: a positive increment means the match is better than if that action wasn't taken, negative means it's worse, etc)

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D44720