[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index
Eric Liu via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Wed Jul 18 04:00:56 PDT 2018
ioeric added a comment.
Some high-level comments before jumping into details.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:23
+
+// FIXME(kbobyrev): Deal with short symbol symbol names.
+std::vector<SearchAtom> generateSearchAtoms(StringRef SymbolName) {
----------------
maybe also add how short symbol names are handled in the current algorithm and what the potential solution would be.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.cpp:28
+ // Apply lowercase text normalization.
+ for (auto &Token : Tokens)
+ std::for_each(Token.begin(), Token.end(), ::tolower);
----------------
Any reason not to do this in segmentation?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:27
+
+/// \brief Hashable SearchAtom, which represents a search token primitive.
+///
----------------
Any reason to call this `Atom` instead of `Token`? `Token` seems to be a more commonly used name for this (e.g. https://yomguithereal.github.io/mnemonist/inverted-index#tokens).
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:45
+public:
+ enum class Namespace : short {
+ Trigram,
----------------
`Namespace` can be easily confused with namespaces in clang world. Maybe `Kind` or `Type`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:51
+
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
----------------
What is an empty atom? Why do we need it?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
Why default `Type` to `Trigram`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:52
+ SearchAtom() = default;
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
----------------
ioeric wrote:
> Why default `Type` to `Trigram`?
Please add documentation. What is the semantic of `Data`?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
Should we also incorporate `Type` into hash?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:53
+ SearchAtom(llvm::StringRef Data, Namespace Type = Namespace::Trigram)
+ : Data(Data), Hash(std::hash<std::string>{}(Data)), Type(Type) {}
+
----------------
ioeric wrote:
> Should we also incorporate `Type` into hash?
I'm wondering if we should use different hashes for different token types. For example, a trigram token "xyz" can be encoded in a 4 -byte int with `('x'<<16) & ('y'<<8) & 'z'`, which seems cheaper than `std::hash`.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:62
+
+ const llvm::StringRef getData() const { return Data; }
+
----------------
nit:
s/getData/data/
s/getType/type/
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:76
+
+/// \brief Splits unqualified symbol name into tokens for trigram generation.
+///
----------------
nit: I'd avoid calling these `tokens`, as it they can be confused with tokens for the invert index. How about `segments`? Similarly, `tokenize` should probably be something like `segment` or `segmentIdentifier`.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchAtom.h:141
+/// trigram belongs to more than one class it is only inserted once.
+std::vector<SearchAtom> generateSearchAtoms(llvm::StringRef SymbolName);
+
----------------
`generateSearchAtoms` is too generalized a name for this. Maybe just `trigram()`? I also think this could take a list of segments from `tokenize`.
https://reviews.llvm.org/D49417
More information about the cfe-commits
mailing list