[PATCH] D40060: [clangd] Fuzzy match scorer
Manuel Klimek via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Thu Nov 23 01:43:40 PST 2017
klimek added inline comments.
================
Comment at: clangd/FuzzyMatch.cpp:69
+ : NPat(std::min<int>(MaxPat, Pattern.size())), NWord(0),
+ ScoreScale(0.5f / NPat) {
+ memcpy(Pat, Pattern.data(), NPat);
----------------
Why .5?
================
Comment at: clangd/FuzzyMatch.cpp:88
+ return None;
+ return ScoreScale * std::min(2 * NPat, std::max<int>(0, Score[NPat][NWord]));
+}
----------------
Why 2 * NPat?
================
Comment at: clangd/FuzzyMatch.cpp:92
+// Segmentation of words and patterns
+// We mark each character as the Head or Tail of a segment, or a Separator.
+// e.g. XMLHttpRequest_Async
----------------
Do you mean "part of the Head or Tail"?
Also, explain that these are the CharRoles. A reader reads this first, and will search for what CharRole means in the code later. CharRole is defined in a different file, without comments, so figuring out how that all relates is super hard :)
================
Comment at: clangd/FuzzyMatch.cpp:101-103
+// 2. A characters segment role can be determined by the Types of itself and
+// its neighbors. e.g. the middle entry in (Upper, Upper, Lower) is a Head.
+// The Empty type is used to represent the start/end of string.
----------------
I think this is the only place where the roles are hinted at. Explain what roles mean and what we need them for.
================
Comment at: clangd/FuzzyMatch.cpp:108
+namespace {
+enum CharType { Empty, Lower, Upper, Punctuation };
+// 4 packed CharTypes per entry, indexed by char.
----------------
I'd spell out the numbers, as they are important (here and for CharRole).
================
Comment at: clangd/FuzzyMatch.cpp:110
+// 4 packed CharTypes per entry, indexed by char.
+constexpr uint8_t CharTypes[] = {
+ 0x00, 0x00, 0x00, 0x00, // Control characters
----------------
Finding bugs in these will be hard :)
================
Comment at: clangd/FuzzyMatch.cpp:120
+ 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, // Bytes over 127 -> Lower.
+ 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, // This includes UTF-8.
+ 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55,
----------------
Can you expand in the comment why this works for utf-8?
================
Comment at: clangd/FuzzyMatch.cpp:137
+} // namespace
+void FuzzyMatcher::calculateRoles(const char *Text, CharRole *Out, int N) {
+ int Types = PackedLookup<CharType>(CharTypes, Text[0]);
----------------
The body of this needs more comments on what it does. I can slowly figure it out by doing bit math, but it should be spelled out what's expected to be in each value at each point.
================
Comment at: clangd/FuzzyMatch.cpp:207
+
+ScoreT FuzzyMatcher::matchBonus(int P, int W) {
+ // Forbidden: matching the first pattern character in the middle of a segment.
----------------
Perhaps add assert(LPat[P] == LWord[W]);
================
Comment at: clangd/FuzzyMatch.cpp:212
+ ScoreT S = 0;
+ // Bonus: pattern is part of a word prefix.
+ if (P == W)
----------------
Why does P == W imply that?
================
Comment at: clangd/FuzzyMatch.cpp:215
+ ++S;
+ // Bonus: case matches, or an asserted word break matches an actual.
+ if (Pat[P] == Word[W] || (PatRole[P] == Head && WordRole[W] == Head))
----------------
This is the first time the term "asserted word break" shows up, perhaps explain this when explaining the roles.
================
Comment at: clangd/FuzzyMatch.cpp:218
+ ++S;
+ // Penalty: matching inside a word where the previous didn't match.
+ if (WordRole[W] == Tail && P && !Matched[P - 1][W - 1])
----------------
The previous what didn't match?
================
Comment at: clangd/FuzzyMatch.h:31
+public:
+ FuzzyMatcher(llvm::StringRef Pattern);
+
----------------
Document that patterns larger than MaxPat will be silently cut.
================
Comment at: clangd/FuzzyMatch.h:39
+
+private:
+ constexpr static int MaxPat = 63;
----------------
I find most of the abbreviations here non-intuitive, and thus needing comments (Pat I get is for pattern :)
N - what does it mean? Number of characters in the pattern? I'd use Length instead.
LPat and LWord (no idea what L could stand for).
================
Comment at: clangd/FuzzyMatch.h:50
+
+ int NPat, NWord;
+ char Pat[MaxPat];
----------------
I'd use a StringRef instead, and call the storage *Storage or something.
================
Comment at: clangd/FuzzyMatch.h:60
+ float ScoreScale;
+ bool IsSubstring;
+};
----------------
Comment that this is not actually used inside the algorithm, just for debugging.
https://reviews.llvm.org/D40060
More information about the cfe-commits
mailing list