[PATCH] D47935: [clangd] Boost completion score according to file proximity.

Wed Jun 13 08:04:46 PDT 2018

ioeric added a comment.

In https://reviews.llvm.org/D47935#1129987, @sammccall wrote:

> Sorry for the delay on this change. There's a bunch of complexity in this problem that I haven't seen how to slice through:
>
> 1. the signals needed seem like a weird fit for the Symbol*Signals structs for some reason (maybe my misdesign)

According to offline discussion, I added a structure  `SymbolRelevanceContext`  that captures per-query signals like proximity paths. Not sure about the name though.

> 2. the inconsistency between how we do this for Sema and for Index results has... only slightly good reasons

The proximity scores for index and sema are now explicitly separated to make it easier to understand and debug.

> 3. the URI vs filename thing is awkward
> 4. with all this, the actual scoring still seems ad-hoc and is missing important parts (main header, transitive includes)
> 
>   Not all your fault that the code reflects this, the problem is tangly. But it's hard for me to reason about APIs or performance or layering.
> 
>   Looking at the last point (scoring model) because it seems the most tractable. I think this is basically an edit distance problem? (We can call the result "proximity", start at one, and multiply by <1, or call it "distance" and start at 0 and add penalties, but it's equivalent).
> 5. we're computing distances between files (glossing over URI-space vs file-space)
> 6. the roots are the main file, and maybe the matching header
> 7. edits take us from a filepath to a related filepath:
>   - from a file to a header it `#include`s
>   - from a file to its parent directory
>   - from a parent directory to a child directory
>   - from a parent directory to a file in it
> 8. the distance is the smallest sum-of-penalties for any path leading from the root to the symbol
> 
>   What do you think of this model?
> 
>   ----
> 
>   **If** the model seems reasonable, then it suggests an approach of building a one-per-query data structure that computes the needed edit-distance recursively, memoizing results. `SymbolRelevanceResults` could store the symbol path and a pointer to the edit-distance machine, and for debugging the machine would know how to describe its configuration. URI/path mapping wouldn't be a performance concern (I think) if the memoization captured it.

I like how this model addresses the proximity for src/ and include/ setup. I think we could start with something simple and iterate, although I agree that we should strike for a design that would be easy to replace the proximity algorithm in the future.

> Let's chat offline?

================
Comment at: clangd/Quality.cpp:171
+/// "uri:///a/b/c" will be treated as /a/b/c
+static float uriProximity(StringRef UX, StringRef UY) {
+  auto SchemeSplitX = UX.split(':');
----------------
sammccall wrote:
> ioeric wrote:
> > sammccall wrote:
> > > This doesn't look quite right to me.
> > > We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?
> > > 
> > > FWIW, fozzie appears to give an additive boost proportional to 5-up, where up is the number of directories from the context you have to traverse up from the context to get to a parent of the symbol. (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)
> > The numbers are guessed... definitely happy to tune.
> > > We can tune the details later, but in practice this seems like it's *very* hard to get zero proximity, which is our neutral score - you need to be 18 directories away?
> > It's 18 directories away if one file is in an ancestor directories of the other (i.e. only traverse up or down). If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so 
> > 5 up in average). I think it's useful to make this distinction because I think it's more likely for a file to use a header if it's in the file path.  
> > 
> > I'm not sure if we should use zero as the neutral score. For example, if a codebase has deep directory structure, most scores are probably going to be small; conversely, most scores would be relatively big. I think relative scores are more useful.
> > 
> > > (There's no penalty for down-traversals probably for implementation reasons, this should be smaller than the up-traversal penalty I think)
> > Why do you think down-traversal should take less penalty?
> > 
> > If you need to traverse up and down, the penalty for each directory is 0.1, which takes 10 directories (up+down, so 5 up in average).
> I think you've halved twice there - it still seems to be 10, which is a lot.
> 
> > I'm not sure if we should use zero as the neutral score.
> Well, zero is currently the neutral score, and this patch doesn't change it :-)
> I think starting at 1 for the current file and *multiplying* by p<1 to apply penalties should give a reasonable 0-1 score that's relatively sane even for codebases of different sizes. Happy to have a different model, but you need to explain/implement how it combines with other signals.
> 
> > Why do you think down-traversal should take less penalty?
> Intuitively, because subprojects are more closely related than superprojects. But this didn't occur to me until someone mentioned it, we should check with Matei and Alexander.
> I think you've halved twice there - it still seems to be 10, which is a lot.
OK you are right. It would behave badly when there are many ups and only one down.

> I think starting at 1 for the current file and *multiplying* by p<1 to apply penalties should give a reasonable 0-1 score that's relatively sane even for codebases of different sizes. 
Sounds good. Picked p=0.7 which seems to give reasonable scores.

Repository:
  rCTE Clang Tools Extra

https://reviews.llvm.org/D47935