[PATCH] D51481: [clangd] Implement proximity path boosting for Dex

Thu Aug 30 07:59:41 PDT 2018

kbobyrev added inline comments.

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:45
+  void build(std::shared_ptr<std::vector<const Symbol *>> Symbols,
+             llvm::ArrayRef<std::string> URISchemes);

----------------
sammccall wrote:
> ioeric wrote:
> > URI schemes are property of `Symbols`. We shouldn't need to pass specify the schemes again. Dex can collect all possible URI schemes when building the index.
> > 
> > I think we could generate proximity paths for index symbols without knowing about schemes (when building the index). For example, we could `canonicalize` the URI (as in `FileDistance.cpp`), for example, by converting `test:///x/y/z` to `/test:/x/y/z`.  For a query, we would first convert query proximity paths to URIs (of all possible schemes), perform the same canonicalization on URIs, and then use the canonicalized paths to compute proximity.
> We had an offline discussion about URIs which might be interesting.
> 
> Fetching posting lists for all URI schemes at query time seems wasteful, @ilya-biryukov pointed out that in practice each file is going to exist in some preferred URI space, and we should only compare that one.
> The only obstacle to doing that is that the preference order for URI schemes is not global, each place we need it it's accepted as configuration.
> 
> Maybe we should change that: as part of registering the URI schemes, we define a preferred order. And instead of the current operation `makeURI(File, scheme)` we should just offer APIs like `makePreferredURI(File)` and `makeFileURI(File)`.
I also thought about extracting schemes during the build stage earlier today, couldn't figure out whether it's a good thing: it's probably tiny overhead for the build stage (if we just pass schemes the server is aware of we avoid looking for new `scheme:` in the prefixes), but it might be worth considering the architecture (and the fact that even though the server might be aware of many schemes, some of them might not occur in the index and that would induce small overhead on the query side).

Could you please elaborate your suggestion about the `canonicalize`? I'm not sure I caught that: IIUC the current approach (which doesn't `canonicalize` URIs) should be sufficient for our needs. I'm not sure why we would want `/test:/x/y/z/` over `test:///x/y/z/` if the extracted tokens are `test:///`, `test:///x/`, `test:///x/y/`, `test:///x/y/z/` (which is the case right now).

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:64

-private:
+  /// For fuzzyFind() Dex retrieves getRetrievalItemsMultiplier() more items
+  /// than requested via Req.MaxCandidateCount in the first stage of filtering.
----------------
ioeric wrote:
> Why are values of multipliers interesting to users? Could these be implementation details in the cpp file?
Actually, my understanding is that users might want to have full access to the multipliers at some point to control the performance/quality ratio.

And it's also useful for the tests: otherwise the last one would have to hard-code number of generated symbols to ensure only boosted ones are in the returned list. It would have to be updated each time these internal multipliers are and we might update them often/make logic less clear (by allowing users to control these parameters).

What do you think?

https://reviews.llvm.org/D51481