[PATCH] D49417: [clangd] Implement trigram generation algorithm for new symbol index
Sam McCall via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Thu Jul 19 03:03:04 PDT 2018
sammccall added a comment.
(just .h files. +1 to eric's comments except where noted)
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:2
+//===--- SearchToken.h- Symbol Search primitive ------------------*- C++
+//-*-===//
+//
----------------
nit: something went wrong here
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:28
+
+/// \brief Hashable SearchToken, which represents a search token primitive.
+///
----------------
nit: remove \brief
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:32
+///
+/// * Trigram for fuzzy search on unqualified symbol names.
+/// * Proximity path primitives, e.g. "symbol is defined in directory
----------------
maybe just giving one example here, and moving the concrete semantics to Kind?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:34
+/// * Proximity path primitives, e.g. "symbol is defined in directory
+/// $HOME/dev/llvm or its prefix".
+/// * Scope primitives, e.g. "symbol belongs to namespace foo::bar or its
----------------
ITYM suffix
consider just "under directory"
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:35
+/// $HOME/dev/llvm or its prefix".
+/// * Scope primitives, e.g. "symbol belongs to namespace foo::bar or its
+/// prefix".
----------------
namespace tokens don't have prefix/suffix semantics, they're exact
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:44
+/// constructing complex iterator trees.
+class SearchToken {
+public:
----------------
nit: this is a really common type, and it's namespaced. `Token` is probably fine.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:46
+public:
+ enum class Kind : short {
+ Trigram,
----------------
doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:46
+public:
+ enum class Kind : short {
+ Trigram,
----------------
sammccall wrote:
> doc: the fact that Kind is a namespace for data, and (on each enum value) the semantics
nit: drop `: short` as you don't seem to make use of it anywhere. Can add it later if we want to optimize in-memory structure (currently it has no effect)
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:48
+ Trigram,
+ Scope,
+ Path,
----------------
in the first patch, probably just want trigram and scope (to ensure generality) and drop the rest
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:52
+
+ Custom,
+ };
----------------
what is this for?
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:61
+ bool operator==(const SearchToken &Other) const {
+ return TokenKind == Other.TokenKind && Data == Other.Data;
+ }
----------------
(nit: since you have the hashcode, comparing it before the data is probably faster?)
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:64
+
+ const llvm::StringRef data() const { return Data; }
+
----------------
nit: first const is not useful
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:64
+
+ const llvm::StringRef data() const { return Data; }
+
----------------
sammccall wrote:
> nit: first const is not useful
instead of these accessors, can we just make it a struct with const members and a constructor that ensures the hash is correctly set?
That should reflect the lightweightness/lack of behavior in this object
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:69
+private:
+ friend llvm::hash_code hash_value(const SearchToken &Token) {
+ return Token.Hash;
----------------
ioeric wrote:
> Who's the user of this friend function? Could it just call `Token.hash()`?
This is the LLVM convention for hashing, see `ADT/Hashing.h`
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:76
+ ///
+ /// These are the examples of Data for different TokenKinds (Token
+ /// Namespaces):
----------------
I'd move these to the Kind enum - these shouldn't be examples but rather the canonical documentation for these
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:81
+ /// * Scope: full scope name, e.g. "foo::bar::baz"
+ /// * Path: full or relative path to the directory
+ /// * Type: full type name or the USR associated with this type
----------------
need to more fully spell this out: when is it full vs relative? is it a URI? Do directories have a trailing slash?
But really, leave this out for now.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:83
+ /// * Type: full type name or the USR associated with this type
+ llvm::SmallString<3> Data;
+ /// \brief Precomputed hash which is used as a key for inverted index.
----------------
So SmallString always contains 3 pointers, i.e 24 bits. The very smallest SmallString that makes sense is <8>, as youll end up padded anyway.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:89
+
+/// \brief Splits unqualified symbol name into segments and casts letters to
+/// lowercase for trigram generation.
----------------
(remove leading \brief here and elsewhere)
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:89
+
+/// \brief Splits unqualified symbol name into segments and casts letters to
+/// lowercase for trigram generation.
----------------
sammccall wrote:
> (remove leading \brief here and elsewhere)
Please move trigram-related stuff to `Trigrams.h`
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:115
+/// FIXME(kbobyrev): Change name of this and everywhere else.
+std::vector<llvm::SmallString<10>>
+segmentIdentifier(llvm::StringRef SymbolName);
----------------
Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch.
Happy to chat about a sensible API to expose and where it should live.
================
Comment at: clang-tools-extra/clangd/index/noctem/SearchToken.h:115
+/// FIXME(kbobyrev): Change name of this and everywhere else.
+std::vector<llvm::SmallString<10>>
+segmentIdentifier(llvm::StringRef SymbolName);
----------------
sammccall wrote:
> Please don't implement a different set of segmentation rules to those that exist in FuzzyMatch.
> Happy to chat about a sensible API to expose and where it should live.
given your examples always return substrings, these should be stringrefs or offsets. But per above comment, we should rethink this function.
https://reviews.llvm.org/D49417
More information about the cfe-commits
mailing list