[PATCH] D50517: [clangd] Generate incomplete trigrams for the Dex index

Kirill Bobyrev via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Aug 10 03:13:12 PDT 2018


kbobyrev added inline comments.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+    // symbol of the identifier.
+    if (!FoundFirstSymbol) {
+      FoundFirstSymbol = true;
----------------
ioeric wrote:
> Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right?
> 
> I'd probably also pulled that into a function, as the function body is getting larger.
Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of the lowercase identifier.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:87
+
+      Chars = {{LowercaseIdentifier[I], LowercaseIdentifier[J], END_MARKER, 0}};
+      const auto Bigram = Token(Token::Kind::Trigram, Chars.data());
----------------
ioeric wrote:
> I think we could be more restrictive on bigram generation. I think a bigram prefix of identifier and a bigram prefix of the HEAD substring should work pretty well in practice. For example, for `StringStartsWith`, you would have `st$` and `ss$` (prefix of "SSW").
> 
> WDYT?
Good idea!


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
   // Apply fuzzy matching text segmentation.
----------------
ioeric wrote:
> It seems to me that what we need for short queries is simply:
> ```
> if (Query.empty()) {
>    // return empty token
> }
> if (Query.size() == 1) return {Query + "$$"};
> if (Query.size() == 2) return {Query + "$"};
> 
> // Longer queries...
> ```
> ?
That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have `"u_"` or something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching `calculateRoles` does.

The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.h:74
+/// For short queries (Query contains less than 3 letters and digits) this
+/// returns a single trigram with all valid symbols.
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query);
----------------
ioeric wrote:
> I'm not quite sure what this means. Could you elaborate?
Added an example and reflected in the other comment.


https://reviews.llvm.org/D50517





More information about the cfe-commits mailing list