[PATCH] D51481: [clangd] Implement proximity path boosting for Dex

Mon Sep 3 01:38:29 PDT 2018

ioeric added inline comments.

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.cpp:36
+    Result.push_back(PathToken);
+  }
   return Result;
----------------
nit: no need for braces. [llvm-style]

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.cpp:126
+    std::vector<std::unique_ptr<Iterator>> BoostingIterators;
+    // This should generate proximity path boosting iterators for each different
+    // URI Scheme the Index is aware of.
----------------
I thought we wanted to take the first scheme that works, given that `URISchemes` is sorted by preference? 

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.cpp:137
+          BoostingIterators.push_back(
+              createBoost(create(It->second), PARENTS_TO_BOOST + 1 - P.second));
+      }
----------------
`PARENTS_TO_BOOST + 1 - P.second` seems to be too much boosting. We should align the boosting function with the one we use in clangd/Quality.cpp, e.g. proximity score should be would be in the range [0, 1], and we can boost by `1+proximity`. 

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.cpp:161

+    // Sort items using boosting score as the key.
+    // FIXME(kbobyrev): Consider merging this stage with the next one? This is
----------------
nit: in general, this should be a combination of relevance score (e.g. proximity) and quality score (e.g. #references).

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.cpp:169
+                  const std::pair<DocID, float> &RHS) {
+                return SymbolQuality.find((*Symbols)[LHS.first])->second *
+                           LHS.second >
----------------
nit: `SymbolQuality[(*Symbols)[LHS.first]]`?

For performance reason, could we instead make `SymbolQuality` a vector indexed by `DocID`?

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:42
 public:
+  DexIndex(llvm::ArrayRef<std::string> URISchemes) : URISchemes(URISchemes) {}
+
----------------
nit: explicit?

================
Comment at: clang-tools-extra/clangd/index/dex/DexIndex.h:42
 public:
+  DexIndex(llvm::ArrayRef<std::string> URISchemes) : URISchemes(URISchemes) {}
+
----------------
ioeric wrote:
> nit: explicit?
Add documentation about `URISchemes`? 

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:18
+
+std::vector<Token> generateProximityPaths(llvm::StringRef URIPath,
+                                          size_t Limit) {
----------------
Add `FIXME` mentioning that the parsing and encoding here are not necessary and could potentially be optimized. 

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:28
+      "Non-empty argument of generateProximityPaths() should be a valid URI.");
+  const auto Scheme = ParsedURI.get().scheme();
+  const auto Authority = ParsedURI.get().authority();
----------------
nit: `ParsedURI->scheme();`

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:32
+  // Get parent directory of the file with symbol declaration.
+  Path = llvm::sys::path::parent_path(Path);
+  while (!Path.empty() && Limit--) {
----------------
I think you need to set `posix` style explicitly here. On windows, the separator is '/'.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:32
+  // Get parent directory of the file with symbol declaration.
+  Path = llvm::sys::path::parent_path(Path);
+  while (!Path.empty() && Limit--) {
----------------
ioeric wrote:
> I think you need to set `posix` style explicitly here. On windows, the separator is '/'.
I think the original path should also be used as a proximity path.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.cpp:51
+    // should be just skipped.
+    if (!static_cast<bool>(URI)) {
+      // Ignore the error.
----------------
You could use `llvm::consumeError(URI.takeError())`.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:45
+  // FIXME(kbobyrev): Storing Data hash would be more efficient than storing raw
+  // strings.
   enum class Kind {
----------------
nit: For example, .... (URI is an good example I think)

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:58
+    ///
+    /// Data stores URI the parent directory of symbol declaration location.
+    ///
----------------
Do we also consider the original URI as a proximity path? 

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:94
+///
+/// When Limit is set, returns no more than Limit tokens.
+std::vector<Token>
----------------
nit: also explain how Limit is used? e.g. closer parent directories are preferred?

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:96
+std::vector<Token>
+generateProximityPaths(llvm::StringRef Path,
+                       size_t Limit = std::numeric_limits<size_t>::max());
----------------
As we are assuming that `Path` is an URI. We should be more explicit in the function names.

================
Comment at: clang-tools-extra/clangd/index/dex/Token.h:104
+                            llvm::ArrayRef<std::string> URISchemes,
+                            size_t Count = 5);
+
----------------
I'm wondering whether we should make `Count` customizable or keep it as implementation detail. `generateProximityPaths` and `generateQueryProximityPaths` should have the same Limit/Count in practice, and I'm not sure if users could usefully customize the value.

https://reviews.llvm.org/D51481