[PATCH] D49591: [clangd] Introduce search Tokens and trigram generation algorithms

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Mon Jul 23 09:24:17 PDT 2018


sammccall added a comment.

Looks really nice! Only major issue is the query trigrams don't look right.
Otherwise, some style nits and fixes that seem to have gotten lost.



================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:28
+
+llvm::StringRef Token::data() const { return Data; }
+
----------------
nit: inline these trivial accessors, so they can be inlined at the callsite


================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:33
+/// constructing complex iterator trees.
+class Token {
+public:
----------------
sammccall wrote:
> As discussed offline: think we misunderstood each other about the separate struct.
> I rather meant *this* class could just be a struct.
This one doesn't seem to be done - any reason to keep the class with accessors rather than just a struct for now?


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:35
+  llvm::DenseSet<Token> UniqueTrigrams;
+  std::vector<Token> Trigrams;
+
----------------
sammccall wrote:
> just iterate at the end and collect them? order shouldn't matter here.
this was not done
(now also in `generateQueryTrigrams`)


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:32
+//
+// Short names (1 segment with <3 characters) are currently ignored.
+std::vector<Token> generateIdentifierTrigrams(std::string Identifier) {
----------------
nit: total segment length <3 characters?
(e.g. u_p is also ignored)


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:38
+                 llvm::makeMutableArrayRef(Roles.data(), Identifier.size()));
+  std::transform(begin(Identifier), end(Identifier), begin(Identifier),
+                 ::tolower);
----------------
nit: seems clearer to call `tolower` inline rather than mutating the input param - "identifier" doesn't really describe its new value


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:39
+  std::transform(begin(Identifier), end(Identifier), begin(Identifier),
+                 ::tolower);
+
----------------
nit: prefer llvm::toLower, which returns `char`


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:94
+  calculateRoles(Query, llvm::makeMutableArrayRef(Roles.data(), Query.size()));
+  std::transform(begin(Query), end(Query), begin(Query), ::tolower);
+
----------------
(again, `toLower` and consider moving to Chars.push_back() call)


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:105
+
+    // New segment starts with Head: flush Chars.
+    if (Roles[I] == Head)
----------------
i.e. you don't want this


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:37
+/// First, given Identifier (unqualified symbol name) is segmented using
+/// calculateRoles and downcasted to lowercase. After segmentation, the
+/// following technique is applied for generating trigrams: for each letter or
----------------
nit: using the same logic as FuzzyMatch
(callers don't know the name calculateRoles, probably)

nit: lowercased
(downcast seems a bit confusing)


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:49
+/// belongs to more than one class it is only inserted once.
+std::vector<Token> generateIdentifierTrigrams(std::string Identifier);
+
----------------
nit: StringRef here and below?


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:56
+/// segment are extracted and returned after deduplication.
+std::vector<Token> generateQueryTrigrams(std::string Query);
+
----------------
this appears to be too restrictive?

e.g. for TUDecl you require [tud, ude, dec, ecl]

I think you just want a sliding window of three successive head/tail characters.


================
Comment at: clang-tools-extra/unittests/clangd/DexIndexTests.cpp:75
+
+  EXPECT_EQ(generateQueryTrigrams("abc_def"), getTrigrams({"abc", "def"}));
+
----------------
seems like it should be [abc, bcd, cde, def]


https://reviews.llvm.org/D49591





More information about the cfe-commits mailing list