[PATCH] D47935: [clangd] Boost completion score according to file proximity.

Fri Jun 8 05:59:34 PDT 2018

ioeric added a comment.

In https://reviews.llvm.org/D47935#1126283, @sammccall wrote:

> Can you benchmark this? I'm nervous about the URI stuff in the hot path.
>  Timing CodeCompleteFlow::measureResults before/after with index enabled seems like a reasonable test.
>  (But you might want to make this apply to sema first too for realistic numbers?)

Sure! I had some numbers but they are on some paper that I don't access to right now... will collect some new figures (with URI manipulations in sema).

================
Comment at: clangd/Quality.cpp:171
+/// "uri:///a/b/c" will be treated as /a/b/c
+static float uriProximity(StringRef UX, StringRef UY) {
+  auto SchemeSplitX = UX.split(':');
----------------
sammccall wrote:
> This doesn't look quite right to me.
> We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?
> 
> FWIW, fozzie appears to give an additive boost proportional to 5-up, where up is the number of directories from the context you have to traverse up from the context to get to a parent of the symbol. (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)
The numbers are guessed... definitely happy to tune.
> We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?
It's 18 directories away if one file is in an ancestor directories of the other (i.e. only traverse up or down). If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so 
5 up in average). I think it's useful to make this distinction because I think it's more likely for a file to use a header if it's in the file path.  

I'm not sure if we should use zero as the neutral score. For example, if a codebase has deep directory structure, most scores are probably going to be small; conversely, most scores would be relatively big. I think relative scores are more useful.

> (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)
Why do you think down-traversal should take less penalty?

================
Comment at: clangd/Quality.cpp:208
+    if (auto U = URI::create(ProximityPath, SymURI.split(':').first)) {
+      ProximityScore += uriProximity(SymURI, U->toString());
+    } else {
----------------
sammccall wrote:
> Why U->toString() rather than ->body()?
Because both URIs need to be parsed in order to use `body()`. Here we don't parse `SymURI`.

================
Comment at: clangd/Quality.cpp:215

 void SymbolRelevanceSignals::merge(const CodeCompletionResult &SemaCCResult) {
   if (SemaCCResult.Availability == CXAvailability_NotAvailable ||
----------------
sammccall wrote:
> proximity path needs to be set here too
Alternatively, I wonder if we could give sema result a fixed proximity score as they are symbols that are already included? 

================
Comment at: clangd/Quality.h:74
+  // If set, this is used to compute proximity from symbol's declaring file.
+  llvm::StringRef ProximityPath;
   /// Proximity between best declaration and the query. [0-1], 1 is closest.
----------------
sammccall wrote:
> sammccall wrote:
> > It seems OK to have ProximityPath or ProximityScore, but we shouldn't have both: drop proximityscore and calculate it during evaluate()?
> what's the plan for associated-header? should this be a smallvector<2>?
Just want to make sure I understand. We would copy the symbol URI to use in `merge` right?

================
Comment at: clangd/Quality.h:74
+  // If set, this is used to compute proximity from symbol's declaring file.
+  llvm::StringRef ProximityPath;
   /// Proximity between best declaration and the query. [0-1], 1 is closest.
----------------
ioeric wrote:
> sammccall wrote:
> > sammccall wrote:
> > > It seems OK to have ProximityPath or ProximityScore, but we shouldn't have both: drop proximityscore and calculate it during evaluate()?
> > what's the plan for associated-header? should this be a smallvector<2>?
> Just want to make sure I understand. We would copy the symbol URI to use in `merge` right?
I think it should be easy to change this to vector when it's actually needed?

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D47935