[PATCH] D49591: [clangd] Introduce search Tokens and trigram generation algorithms

Fri Jul 20 03:34:40 PDT 2018

sammccall added inline comments.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:25
+         Data.size() == 3 && "Trigram should contain three characters.");
+  switch (TokenKind) {
+  case Kind::Trigram:
----------------
specializing the hash function looks like premature optimization (or possibly pessimisation)
One the one hand, it's a perfect hash!

On the other hand, when used with a power-of-two sized hashtable (particularly 256 is a fun example), all the trigrams are going to hash into the alphanumeric buckets, and other buckets will be empty!

DenseMap is a power-of-two hashtable!

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:30
+  default:
+    Hash = std::hash<std::string>{}(Data);
+    break;
----------------
llvm::hash_combine(static_cast<int>(TokenKind),Data)

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:10
+//
+// Tokens are keys for inverted index which are mapped to the
+// corresponding posting lists. Token objects represent a characteristic
----------------
nit: swap these sentences around? conceptual explanation first, how they're used later?

could even give an example:
```
// The symbol std::cout might have the tokens:
// * Scope "std::"
// * Trigram "cou"
// * Trigram "out"
// * Type "std::ostream"
```

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:28
+
+/// Hashable Token, which represents a search token primitive, such as
+/// trigram for fuzzy search on unqualified symbol names.
----------------
nit: "Hashable" isn't really important here.
nit: it doesn't *represent* a search primitive, it *is* a search primitive.

Maybe
```
/// A Token represents an attribute of a symbol, such as a particular
/// trigram present in the name (used for fuzzy search).
```

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:33
+/// constructing complex iterator trees.
+class Token {
+public:
----------------
As discussed offline: think we misunderstood each other about the separate struct.
I rather meant *this* class could just be a struct.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:36
+  /// Kind specifies Token type which defines semantics for the internal
+  /// representation (Data field), examples of such types are:
+  ///
----------------
don't give examples of the enum values immediately before defining the enum values :-)
Just put the comments on the values themselves, and omit the ones that aren't here yet (or leave a TODO)

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:47
+  ///
+  /// * Trigram: 3 bytes containing trigram characters
+  /// * Scope: full scope name, such as "foo::bar::baz::" or "" (global scope)
----------------
move these to enum values

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:58
+
+  Token(llvm::StringRef Data, Kind TokenKind);
+
----------------
nit: probably reverse the order here since  Token(TRIGRAM, "str") reads more naturally than Token("str", TRIGRAM) - context is present to understand the data

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:60
+
+  // Returns precomputed hash.
+  size_t hash(const Token &T) const { return Hash; }
----------------
can we just use the standard hash_value() function? Why this accessor?

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:79
+  /// Precomputed hash which is used as a key for inverted index.
+  size_t Hash;
+  Kind TokenKind;
----------------
as discussed offline: this precomputed hash looks like premature optimization to me.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:35
+  llvm::DenseSet<Token> UniqueTrigrams;
+  std::vector<Token> Trigrams;
+
----------------
just iterate at the end and collect them? order shouldn't matter here.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:37
+
+  // Extract trigrams consisting of first characters of tokens sorted bytoken
+  // positions. Trigram generator is allowed to skip 1 word between each token.
----------------
ah, you're also handling this by cases here.

This doesn't seem necessary.
In fact I think there's a fairly simple way to express this that works naturally with the segmentation structure produced by fuzzymatch (so doesn't need a version of segmentIdentifier to wrap it).
But it's taking me too long to work out the details, will write a followup comment or discuss offline.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:46
+  // this case.
+  for (auto FirstSegment = Segments.begin(); FirstSegment != Segments.end();
+       ++FirstSegment) {
----------------
please use short names like I, S1, etc for iterators and loop counters.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:90
+         ++SecondSegment) {
+      for (size_t FirstSegmentIndex = 0;
+           FirstSegmentIndex < FirstSegment->size(); ++FirstSegmentIndex) {
----------------
There seems to be a lot of overlap between this one and the third loop.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:10
+//
+// T
+//
----------------
This would be a great place to describe *how* trigrams are used by fuzzy matching (the idea that you generate trigrams T for the identifier, trigrams Q for the query, and require Q ⊆ T)

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:56
+
+/// Returns list of unique fuzzy-search trigrams from unqualified symbol.
+///
----------------
Per the examples above, segmentIdentifier doesn't case-fold its output, but all the inputs shown here are lowercase.
Is the caller expected to do that in between calling here, or does generateTrigrams case-fold as part of generation?

(please doc)

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:58
+///
+/// Runs trigram generation for fuzzy-search index on segments produced by
+/// segmentIdentifier(SymbolName);
----------------
seem to have two summaries here, drop or merge with previous?

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:64
+/// are 3-char suffixes of paths through the fuzzy matching automaton. There are
+/// four classes of extracted trigrams:
+///
----------------
This is true, but I think treating these as four cases doesn't give a clear understanding of how they're related.

What about:

```
Trigrams can start at any character in the input. Then we can choose to move to the next character, move to the start of the next segment, or skip over a segment.

For input [abstract factory producer impl]:
* the simplest trigrams are consecutive letters from one segment (abs, bst, ..., mpl)
* but may cross two segments (abf, bpr)
* or even three (api, ypi)

https://reviews.llvm.org/D49591