[PATCH] D49591: [clangd] Introduce search Tokens and trigram generation algorithms
Sam McCall via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Fri Jul 20 03:34:40 PDT 2018
sammccall added inline comments.
================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:25
+ Data.size() == 3 && "Trigram should contain three characters.");
+ switch (TokenKind) {
+ case Kind::Trigram:
----------------
specializing the hash function looks like premature optimization (or possibly pessimisation)
One the one hand, it's a perfect hash!
On the other hand, when used with a power-of-two sized hashtable (particularly 256 is a fun example), all the trigrams are going to hash into the alphanumeric buckets, and other buckets will be empty!
DenseMap is a power-of-two hashtable!
================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:30
+ default:
+ Hash = std::hash<std::string>{}(Data);
+ break;
----------------
llvm::hash_combine(static_cast<int>(TokenKind),Data)
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:10
+//
+// Tokens are keys for inverted index which are mapped to the
+// corresponding posting lists. Token objects represent a characteristic
----------------
nit: swap these sentences around? conceptual explanation first, how they're used later?
could even give an example:
```
// The symbol std::cout might have the tokens:
// * Scope "std::"
// * Trigram "cou"
// * Trigram "out"
// * Type "std::ostream"
```
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:28
+
+/// Hashable Token, which represents a search token primitive, such as
+/// trigram for fuzzy search on unqualified symbol names.
----------------
nit: "Hashable" isn't really important here.
nit: it doesn't *represent* a search primitive, it *is* a search primitive.
Maybe
```
/// A Token represents an attribute of a symbol, such as a particular
/// trigram present in the name (used for fuzzy search).
```
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:33
+/// constructing complex iterator trees.
+class Token {
+public:
----------------
As discussed offline: think we misunderstood each other about the separate struct.
I rather meant *this* class could just be a struct.
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:36
+ /// Kind specifies Token type which defines semantics for the internal
+ /// representation (Data field), examples of such types are:
+ ///
----------------
don't give examples of the enum values immediately before defining the enum values :-)
Just put the comments on the values themselves, and omit the ones that aren't here yet (or leave a TODO)
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:47
+ ///
+ /// * Trigram: 3 bytes containing trigram characters
+ /// * Scope: full scope name, such as "foo::bar::baz::" or "" (global scope)
----------------
move these to enum values
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:58
+
+ Token(llvm::StringRef Data, Kind TokenKind);
+
----------------
nit: probably reverse the order here since Token(TRIGRAM, "str") reads more naturally than Token("str", TRIGRAM) - context is present to understand the data
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:60
+
+ // Returns precomputed hash.
+ size_t hash(const Token &T) const { return Hash; }
----------------
can we just use the standard hash_value() function? Why this accessor?
================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:79
+ /// Precomputed hash which is used as a key for inverted index.
+ size_t Hash;
+ Kind TokenKind;
----------------
as discussed offline: this precomputed hash looks like premature optimization to me.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:35
+ llvm::DenseSet<Token> UniqueTrigrams;
+ std::vector<Token> Trigrams;
+
----------------
just iterate at the end and collect them? order shouldn't matter here.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:37
+
+ // Extract trigrams consisting of first characters of tokens sorted bytoken
+ // positions. Trigram generator is allowed to skip 1 word between each token.
----------------
ah, you're also handling this by cases here.
This doesn't seem necessary.
In fact I think there's a fairly simple way to express this that works naturally with the segmentation structure produced by fuzzymatch (so doesn't need a version of segmentIdentifier to wrap it).
But it's taking me too long to work out the details, will write a followup comment or discuss offline.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:46
+ // this case.
+ for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+ ++FirstSegment) {
----------------
please use short names like I, S1, etc for iterators and loop counters.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:90
+ ++SecondSegment) {
+ for (size_t FirstSegmentIndex = 0;
+ FirstSegmentIndex < FirstSegment->size(); ++FirstSegmentIndex) {
----------------
There seems to be a lot of overlap between this one and the third loop.
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:10
+//
+// T
+//
----------------
This would be a great place to describe *how* trigrams are used by fuzzy matching (the idea that you generate trigrams T for the identifier, trigrams Q for the query, and require Q ⊆ T)
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:56
+
+/// Returns list of unique fuzzy-search trigrams from unqualified symbol.
+///
----------------
Per the examples above, segmentIdentifier doesn't case-fold its output, but all the inputs shown here are lowercase.
Is the caller expected to do that in between calling here, or does generateTrigrams case-fold as part of generation?
(please doc)
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:58
+///
+/// Runs trigram generation for fuzzy-search index on segments produced by
+/// segmentIdentifier(SymbolName);
----------------
seem to have two summaries here, drop or merge with previous?
================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:64
+/// are 3-char suffixes of paths through the fuzzy matching automaton. There are
+/// four classes of extracted trigrams:
+///
----------------
This is true, but I think treating these as four cases doesn't give a clear understanding of how they're related.
What about:
```
Trigrams can start at any character in the input. Then we can choose to move to the next character, move to the start of the next segment, or skip over a segment.
For input [abstract factory producer impl]:
* the simplest trigrams are consecutive letters from one segment (abs, bst, ..., mpl)
* but may cross two segments (abf, bpr)
* or even three (api, ypi)
https://reviews.llvm.org/D49591
More information about the cfe-commits
mailing list