[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index

Eric Liu via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu Jul 19 02:58:21 PDT 2018


ioeric added inline comments.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:38
+
+// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would
+// be generating unigrams and bigrams here, too. This would prevent symbol index
----------------
Please also add what the current behavior is for short names. Do we just ignore them?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44
+generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) {
+  llvm::DenseSet<SearchToken> UniqueTrigrams;
+  std::vector<SearchToken> Trigrams;
----------------
Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to token?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:47
+
+  // Extract trigrams consisting of first characters of tokens sorted by of
+  // token positions. Trigram generator is allowed to skip 1 word between each
----------------
nit: a redundant "of" EOL.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57
+  // this case.
+  for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+       ++FirstSegment) {
----------------
nit: Maybe S1, S2, S3 instead of FirstSegment, ...?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+        SearchToken Trigram(
+            (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+            SearchToken::Kind::Trigram);
----------------
This seems wrong... wouldn't this give you a concatenation of three segments?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+        SearchToken Trigram(
+            (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+            SearchToken::Kind::Trigram);
----------------
ioeric wrote:
> This seems wrong... wouldn't this give you a concatenation of three segments?
For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87
+
+    for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
+      Trigrams.push_back(
----------------
nit: `Position <  Segment.size() - 2` seems more commonly used.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:92
+
+  for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+       ++FirstSegment) {
----------------
Comment for this loop?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:135
+  // Skip underscores at the beginning, e.g. "__builtin_popcount".
+  while (SymbolName[SegmentStart] == '_')
+    ++SegmentStart;
----------------
use `trim`?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138
+
+  for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
+    const char CurrentSymbol = SymbolName[Index];
----------------
Maybe first split name on `_` and them run further upper-lower segmentation on the split segments?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:58
+  // Returns precomputed hash.
+  size_t operator()(const SearchToken &T) const { return Hash; }
+
----------------
Any reason this has to be `operator()` instead of a `hash` method? `operator()` for hash value is not trivial on the call site.


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:69
+private:
+  friend llvm::hash_code hash_value(const SearchToken &Token) {
+    return Token.Hash;
----------------
Who's the user of this friend function? Could it just call `Token.hash()`?


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:80
+  /// * Trigram: 3 bytes containing trigram characters
+  /// * Scope: full scope name, e.g. "foo::bar::baz"
+  /// * Path: full or relative path to the directory
----------------
nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".


================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:159
+/// \brief Combines segmentIdentifier() and generateTrigrams(Segments).
+std::vector<SearchToken> generateTrigrams(llvm::StringRef SymbolName);
+
----------------
This seems to be the same as just `generateTrigrams(segmentIdentifier(Name))`, so I'd drop it.


https://reviews.llvm.org/D49417





More information about the cfe-commits mailing list