[PATCH] D40060: [clangd] Fuzzy match scorer

Manuel Klimek via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Nov 23 01:43:40 PST 2017


klimek added inline comments.


================
Comment at: clangd/FuzzyMatch.cpp:69
+    : NPat(std::min<int>(MaxPat, Pattern.size())), NWord(0),
+      ScoreScale(0.5f / NPat) {
+  memcpy(Pat, Pattern.data(), NPat);
----------------
Why .5?


================
Comment at: clangd/FuzzyMatch.cpp:88
+    return None;
+  return ScoreScale * std::min(2 * NPat, std::max<int>(0, Score[NPat][NWord]));
+}
----------------
Why 2 * NPat?


================
Comment at: clangd/FuzzyMatch.cpp:92
+// Segmentation of words and patterns
+// We mark each character as the Head or Tail of a segment, or a Separator.
+// e.g. XMLHttpRequest_Async
----------------
Do you mean "part of the Head or Tail"?
Also, explain that these are the CharRoles. A reader reads this first, and will search for what CharRole means in the code later. CharRole is defined in a different file, without comments, so figuring out how that all relates is super hard :)


================
Comment at: clangd/FuzzyMatch.cpp:101-103
+//   2. A characters segment role can be determined by the Types of itself and
+//      its neighbors. e.g. the middle entry in (Upper, Upper, Lower) is a Head.
+//      The Empty type is used to represent the start/end of string.
----------------
I think this is the only place where the roles are hinted at. Explain what roles mean and what we need them for.


================
Comment at: clangd/FuzzyMatch.cpp:108
+namespace {
+enum CharType { Empty, Lower, Upper, Punctuation };
+// 4 packed CharTypes per entry, indexed by char.
----------------
I'd spell out the numbers, as they are important (here and for CharRole).


================
Comment at: clangd/FuzzyMatch.cpp:110
+// 4 packed CharTypes per entry, indexed by char.
+constexpr uint8_t CharTypes[] = {
+    0x00, 0x00, 0x00, 0x00, // Control characters
----------------
Finding bugs in these will be hard :)


================
Comment at: clangd/FuzzyMatch.cpp:120
+    0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, // Bytes over 127 -> Lower.
+    0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, // This includes UTF-8.
+    0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55,
----------------
Can you expand in the comment why this works for utf-8?


================
Comment at: clangd/FuzzyMatch.cpp:137
+} // namespace
+void FuzzyMatcher::calculateRoles(const char *Text, CharRole *Out, int N) {
+  int Types = PackedLookup<CharType>(CharTypes, Text[0]);
----------------
The body of this needs more comments on what it does. I can slowly figure it out by doing bit math, but it should be spelled out what's expected to be in each value at each point.


================
Comment at: clangd/FuzzyMatch.cpp:207
+
+ScoreT FuzzyMatcher::matchBonus(int P, int W) {
+  // Forbidden: matching the first pattern character in the middle of a segment.
----------------
Perhaps add assert(LPat[P] == LWord[W]);


================
Comment at: clangd/FuzzyMatch.cpp:212
+  ScoreT S = 0;
+  // Bonus: pattern is part of a word prefix.
+  if (P == W)
----------------
Why does P == W imply that?


================
Comment at: clangd/FuzzyMatch.cpp:215
+    ++S;
+  // Bonus: case matches, or an asserted word break matches an actual.
+  if (Pat[P] == Word[W] || (PatRole[P] == Head && WordRole[W] == Head))
----------------
This is the first time the term "asserted word break" shows up, perhaps explain this when explaining the roles.


================
Comment at: clangd/FuzzyMatch.cpp:218
+    ++S;
+  // Penalty: matching inside a word where the previous didn't match.
+  if (WordRole[W] == Tail && P && !Matched[P - 1][W - 1])
----------------
The previous what didn't match?


================
Comment at: clangd/FuzzyMatch.h:31
+public:
+  FuzzyMatcher(llvm::StringRef Pattern);
+
----------------
Document that patterns larger than MaxPat will be silently cut.


================
Comment at: clangd/FuzzyMatch.h:39
+
+private:
+  constexpr static int MaxPat = 63;
----------------
I find most of the abbreviations here non-intuitive, and thus needing comments (Pat I get is for pattern :)
N - what does it mean? Number of characters in the pattern? I'd use Length instead.
LPat and LWord (no idea what L could stand for).



================
Comment at: clangd/FuzzyMatch.h:50
+
+  int NPat, NWord;
+  char Pat[MaxPat];
----------------
I'd use a StringRef instead, and call the storage *Storage or something.


================
Comment at: clangd/FuzzyMatch.h:60
+  float ScoreScale;
+  bool IsSubstring;
+};
----------------
Comment that this is not actually used inside the algorithm, just for debugging.


https://reviews.llvm.org/D40060





More information about the cfe-commits mailing list