[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index

Wed Jul 18 04:00:56 PDT 2018

ioeric added a comment.

Some high-level comments before jumping into details.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:23
+
+// FIXME(kbobyrev): Deal with short symbol symbol names.
+std::vector<SearchAtom> generateSearchAtoms(StringRef SymbolName) {
----------------
maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:28
+  // Apply lowercase text normalization.
+  for (auto &Token : Tokens)
+    std::for_each(Token.begin(), Token.end(), ::tolower);
----------------
Any reason not to do this in segmentation?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:27
+
+/// \brief Hashable SearchAtom, which represents a search token primitive.
+///
----------------
Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens). 

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:45
+public:
+  enum class Namespace : short {
+    Trigram,
----------------
`Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:51
+
+  SearchAtom() = default;
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
----------------
What is an empty atom? Why do we need it?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+  SearchAtom() = default;
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
Why default `Type` to `Trigram`? 

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+  SearchAtom() = default;
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
ioeric wrote:
> Why default `Type` to `Trigram`? 
Please add documentation. What is the semantic of `Data`? 

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
Should we also incorporate `Type` into hash?

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+  SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+      : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> Should we also incorporate `Type` into hash?
I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:62
+
+  const llvm::StringRef getData() const { return Data; }
+
----------------
nit:
s/getData/data/
s/getType/type/

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:76
+
+/// \brief Splits unqualified symbol name into tokens for trigram generation.
+///
----------------
nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert index. How about `segments`?  Similarly, `tokenize` should probably be something like `segment` or `segmentIdentifier`.

================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:141
+///  trigram belongs to more than one class it is only inserted once.
+std::vector<SearchAtom> generateSearchAtoms(llvm::StringRef SymbolName);
+
----------------
`generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think this could take a list of segments from `tokenize`.

https://reviews.llvm.org/D49417