[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index

Fri Jul 20 01:38:49 PDT 2018

omtcyfz added inline comments.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> ioeric wrote:
> > Should we also incorporate `Type` into hash?
> I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
Discussed internally: we probably shouldn't since there will be collisions even inside a single token namespace when Data will be too long.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:44
+generateTrigrams(const std::vector<llvm::SmallString<10>> &Segments) {
+  llvm::DenseSet<SearchToken> UniqueTrigrams;
+  std::vector<SearchToken> Trigrams;
----------------
ioeric wrote:
> Could we replace `UniqaueTrigrams`+`Trigrams` with a dense map from hash to token?
I'm not sure how to iterate through the result then. Basically, having Trigrams ensures that these trigrams can be iterated after the function execution and inserted into the inverted index. Otherwise I should either expose a callback to add the generated trigrams or do something similar. Should I do that instead?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:57
+  // this case.
+  for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+       ++FirstSegment) {
----------------
ioeric wrote:
> nit: Maybe S1, S2, S3 instead of FirstSegment, ...?
I think it's way less explicit and slightly more confusing. I try not to have one-letter names so I guess it's better to have full names instead.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:68
+        SearchToken Trigram(
+            (*FirstSegment + *SecondSegment + *ThirdSegment).str(),
+            SearchToken::Kind::Trigram);
----------------
ioeric wrote:
> ioeric wrote:
> > This seems wrong... wouldn't this give you a concatenation of three segments?
> For trigrams, it might make sense to put 3 chars into a `SmallVector<3>` (can be reused) and std::move it into the constructor. Might be cheaper than creating a std::string
But `std::string` would be created anyway, wouldn't it be?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:87
+
+    for (size_t Position = 0; Position + 2 < Segment.size(); ++Position)
+      Trigrams.push_back(
----------------
ioeric wrote:
> nit: `Position <  Segment.size() - 2` seems more commonly used.
If `Segment.size()` is 1 then this would be `1U - 2U`, which would be something very large.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.cpp:138
+
+  for (size_t Index = SegmentStart; Index + 1 < SymbolName.size(); ++Index) {
+    const char CurrentSymbol = SymbolName[Index];
----------------
ioeric wrote:
> Maybe first split name on `_` and them run further upper-lower segmentation on the split segments?
Resolving both of these for now (though they're not really fixed) since I will rethink the algorithm given Sam's patch.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:52
+
+    Custom,
+  };
----------------
sammccall wrote:
> what is this for?
Tags, for example. But yes, for now it's not very clear and probably not needed. Dropped this one.

https://reviews.llvm.org/D49417