[PATCH] D40060: [clangd] Fuzzy match scorer

Wed Nov 29 03:09:01 PST 2017

ilya-biryukov added inline comments.

================
Comment at: clangd/FuzzyMatch.cpp:118
+    0x00, 0x00, 0x00, 0x00, // Control characters
+    0xff, 0xff, 0xff, 0xff, // Punctuation
+    0x55, 0x55, 0xf5, 0xff, // Numbers->Lower, more Punctuation.
----------------
I'm not sure if we care, but maybe we should treat `+`, `-` and other symbols that could be in operator names (e.g., `operator +`) differently for C++.
Could also make sense for other languages with overloaded operators.

================
Comment at: clangd/FuzzyMatch.cpp:244
+  // Forbidden: matching the first pattern character in the middle of a segment.
+  if (!P && WordRole[W] == Tail)
+    return AwfulScore;
----------------
Maybe use `P == 0` instead? It's equivalent, but a bit easier to read if you think of `P` as an offset.
Totally subjective, though, it's fine to have it either way.

================
Comment at: clangd/FuzzyMatch.cpp:245
+  if (!P && WordRole[W] == Tail)
+    return AwfulScore;
+  ScoreT S = 0;
----------------
Does it mean I will get no matches in the following situation?

`Items = [printf, scanf]`
`Pattern = f`

It may be a bit confusing, since I do have a match, even though is terrible and it's ok to put those items very low in the list.
A more real example is:
`Items = [fprintf, fscanf]`
`Pattern = print`

Would `fprintf` match in that case? I think it should.

Another important one:
`Items = [istream, ostream]`
`Pattern = stream`

================
Comment at: clangd/FuzzyMatch.h:31
+public:
+  // Characters beyond MaxPat are ignored.
+  FuzzyMatcher(llvm::StringRef Pattern);
----------------
Maybe move the truncating logic into the clients? The users of this code are certainly better suited to report warnings/reject requests that are too large.

================
Comment at: clangd/FuzzyMatch.h:53
+
+  int PatN, WordN;                               // Length of pattern and word.
+  char Pat[MaxPat], Word[MaxWord];               // Raw pattern and word data.
----------------
Maybe we could split the data we store into two sections:
1. Pattern-specific data. Initialized on construction, never changed later.
2. Per-match data. Initialized per `match()` call.

Otherwise it is somewhat hard to certify whether everything is being initialized properly.

https://reviews.llvm.org/D40060