[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index
Eric Liu via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Thu Jul 19 02:58:21 PDT 2018
ioeric added inline comments.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:38
+
+// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would
+// be generating unigrams and bigrams here, too. This would prevent symbol index
----------------
Please also add what the current behavior is for short names. Do we just ignore them?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44
+generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) {
+ llvm::DenseSet<SearchToken> UniqueTrigrams;
+ std::vector<SearchToken> Trigrams;
----------------
Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to token?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:47
+
+ // Extract trigrams consisting of first characters of tokens sorted by of
+ // token positions. Trigram generator is allowed to skip 1 word between each
----------------
nit: a redundant "of" EOL.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57
+ // this case.
+ for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+ ++FirstSegment) {
----------------
nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+ SearchToken Trigram(
+ (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+ SearchToken::Kind::Trigram);
----------------
This seems wrong... wouldn't this give you a concatenation of three segments?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+ SearchToken Trigram(
+ (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+ SearchToken::Kind::Trigram);
----------------
ioeric wrote:
> This seems wrong... wouldn't this give you a concatenation of three segments?
For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87
+
+ for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
+ Trigrams.push_back(
----------------
nit: `Position < Segment.size() - 2` seems more commonly used.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:92
+
+ for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+ ++FirstSegment) {
----------------
Comment for this loop?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:135
+ // Skip underscores at the beginning, e.g. "__builtin_popcount".
+ while (SymbolName[SegmentStart] == '_')
+ ++SegmentStart;
----------------
use `trim`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138
+
+ for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
+ const char CurrentSymbol = SymbolName[Index];
----------------
Maybe first split name on `_` and them run further upper-lower segmentation on the split segments?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:58
+ // Returns precomputed hash.
+ size_t operator()(const SearchToken &T) const { return Hash; }
+
----------------
Any reason this has to be `operator()` instead of a `hash` method? `operator()` for hash value is not trivial on the call site.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:69
+private:
+ friend llvm::hash_code hash_value(const SearchToken &Token) {
+ return Token.Hash;
----------------
Who's the user of this friend function? Could it just call `Token.hash()`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:80
+ /// * Trigram: 3 bytes containing trigram characters
+ /// * Scope: full scope name, e.g. "foo::bar::baz"
+ /// * Path: full or relative path to the directory
----------------
nit: scope in clangd world is "foo::bar::baz::", but the global scope is "".
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:159
+/// \brief Combines segmentIdentifier() and generateTrigrams(Segments).
+std::vector<SearchToken> generateTrigrams(llvm::StringRef SymbolName);
+
----------------
This seems to be the same as just `generateTrigrams(segmentIdentifier(Name))`, so I'd drop it.
https://reviews.llvm.org/D49417
More information about the cfe-commits
mailing list