[PATCH] D50517: [clangd] Generate incomplete trigrams for the Dex index

Fri Aug 10 02:26:34 PDT 2018

ioeric added a comment.

Thanks for the patch!

In https://reviews.llvm.org/D50517#1194955, @kbobyrev wrote:

> Complete the tests, finish the implementation.
>
> One thought about prefix match suggestion: we should either make it more explicit for the index (e.g. introduce `prefixMatch` and dispatch `fuzzyMatch` to prefix matching in case query only contains one "true" symbol) or document this properly. While, as I wrote earlier, I totally support the idea of prefix matching queries of length 1 it might not align with some user expectations and it's also very implicit if we just generate tokens this way and don't mention it anywhere in the `DexIndex` implementation.
>
> @ioeric, @ilya-biryukov any thoughts?

(copied my inline comment :)
We should definitely add documentation about it. It should be pretty simple IMO. As the behavior should be easy to infer from samples, and it shouldn't be too surprising for users, I think it would be OK to consider it as implementation detail (like details in how exactly trigrams are generated) without exposing new interfaces for them.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:26

-// FIXME(kbobyrev): Deal with short symbol symbol names. A viable approach would
-// be generating unigrams and bigrams here, too. This would prevent symbol index
-// from applying fuzzy matching on a tremendous number of symbols and allow
-// supplementary retrieval for short queries.
-//
-// Short names (total segment length <3 characters) are currently ignored.
+// FIXME(kbobyrev): Posting lists for incomplete trigrams (one/two symbols) are
+// likely to be very dense and hence require special attention so that the index
----------------
It's a nice to optimization have when we run into oversized posting lists, but this is not necessarily restricted to unigram posting lists. I think the FIXME should live near the general posting list code. I think it's probably also ok to leave it out; it's hard to forget if we do run into problem in the future ;)

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+    // symbol of the identifier.
+    if (!FoundFirstSymbol) {
+      FoundFirstSymbol = true;
----------------
Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right?

I'd probably also pulled that into a function, as the function body is getting larger.

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87
+
+      Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER, 0}};
+      const auto Bigram = Token(Token::Kind::Trigram, Chars.data());
----------------
I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW").

WDYT?

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
   // Apply fuzzy matching text segmentation.
----------------
It seems to me that what we need for short queries is simply:
```
if (Query.empty()) {
   // return empty token
}
if (Query.size() == 1) return {Query + "$$"};
if (Query.size() == 2) return {Query + "$"};

// Longer queries...
```
?

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:39
+// member of Token even though it's Trigram-specific?
+const auto END_MARKER = '$';
+
----------------
Any reason why this should be exposed?

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:62
 /// belongs to more than one class it is only inserted once.
+// FIXME(kbobyrev): Document somewhere in DexIndex that for queries of size 1
+// it will do prefix matching instead of fuzzy matching on the identifier names,
----------------
The behavior should be easy to infer from samples. As long as it's not totally expected, I think it would be OK to consider treat as implementation detail (like details in how trigrams are generated).

================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74
+/// For short queries (Query contains less than 3 letters and digits) this
+/// returns a single trigram with all valid symbols.
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);
----------------
I'm not quite sure what this means. Could you elaborate?

https://reviews.llvm.org/D50517