[PATCH] D50517: [clangd] Generate incomplete trigrams for the Dex index

Eric Liu via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Fri Aug 10 03:53:53 PDT 2018


ioeric added inline comments.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:74
+    // symbol of the identifier.
+    if (!FoundFirstSymbol) {
+      FoundFirstSymbol = true;
----------------
kbobyrev wrote:
> ioeric wrote:
> > Could this be pulled out of the loop? I think what we want is just `LowercaseIdentifier[0]` right?
> > 
> > I'd probably also pulled that into a function, as the function body is getting larger.
> Same as elsewhere, if we have `__builtin_whatever` the it's not actually the first symbol of the lowercase identifier.
I would argue that I would start by typing "_" if I actually want `__builtin_whatever`. I'm also not sure if this is the case we should optimize for as well; __builtin symbols  are already penalized in code completion ranking.


================
Comment at: clang-tools-extra/clangd/index/dex/Trigram.cpp:115
+// FIXME(kbobyrev): Correctly handle empty trigrams "$$$".
 std::vector<Token> generateQueryTrigrams(llvm::StringRef Query) {
   // Apply fuzzy matching text segmentation.
----------------
kbobyrev wrote:
> ioeric wrote:
> > It seems to me that what we need for short queries is simply:
> > ```
> > if (Query.empty()) {
> >    // return empty token
> > }
> > if (Query.size() == 1) return {Query + "$$"};
> > if (Query.size() == 2) return {Query + "$"};
> > 
> > // Longer queries...
> > ```
> > ?
> That would mean that we expect the query to be "valid", i.e. only consist of letters and digits. My concern is about what happens if we have `"u_"` or something similar (`"_u", "_u_", "$u$"`, etc) - in that case we would actually still have to identify the first valid symbol for the trigram, process the string (trim it, etc) which looks very similar to what FuzzyMatching `calculateRoles` does.
> 
> The current approach is rather straightforward and generic, but I can try to change it if you want. My biggest concern is fighting some corner cases and ensuring that the query is "right" on the user (index) side, which might turn out to be more code and ensuring that the "state" is valid throughout the pipeline.
It's not clear what we would want to match with "*_", except for `u_` in `unique_ptr` (maybe).  

Generally, as short queries tend to match many more symbols, I think we should try to make them more restrictive and optimize for the most common use case. 


https://reviews.llvm.org/D50517





More information about the cfe-commits mailing list