[PATCH] D72874: [clangd] Add a textual fallback for go-to-definition

Mon Mar 2 05:02:56 PST 2020

sammccall added a comment.

Thanks! The scope looks good to me now, on to implementation details.
I'm being a bit picky on the behaivor because go-to-def is a heavily-used feature, many users won't be expecting what we're doing here, and we can't reasonably expect them to understand the failure modes.
So, let's try hard not to fail :-)

This reminds me: it's not completely obvious what set of "act on symbol under the cursor" things this should (eventually) apply to.
I think not having e.g. find-references work makes sense - user should navigate to a "real" occurrence to resolve the ambiguity, and things like code actions are right out.
However having `textDocument/hover` work when we have high confidence in results would be really cool.
Obviously nothing in scope for this patch, but it seems worth writing this down somewhere, precisely because we shouldn't do it now.

In D72874#1900149 <https://reviews.llvm.org/D72874#1900149>, @nridge wrote:

> I currently handle `lowerCamel`, `UpperCamel`, `CAPS`, and `under_scores`. I've left the others as follow-ups.

(sorry for shifting goalposts, I think `CAPS` may be too broad. Left a comment inline)

>> - if you get more than 3 results, and none from current file, maybe don't return anything, as confidence is too low. Or try a stricter query...
> 
> I implemented this, but my testing shows this causes a lot of results for class names to be excluded. The reason appears to be that `fuzzyFind()` returns the class and each of its constructors as distinct results, so if a class has more than two constructors, we'll have more than 3 results (and typically the class is declared in a different file).

I think we should just drop constructor results, they'll always have this problem.
(There are other cases but this is the biggest).

>> - handle the most common case of non-indexable symbols (local symbols) by running the query against the closest occurrence of the token in code.
> 
> I've left this as a follow-up.

Makes sense. I think this there's not a lot of new complexity here, we have the major pieces (getWordAtPosition, TokenBuffer, SelectionTree, targetDecl, index) but integration is definitely substantial.

I'd suggest we go down that path before adding complexity for the indexed-based path though, because I suspect it's going to handle many of the practical situations where the index-based approach needs a lot of help (and vice-versa).

================
Comment at: clang-tools-extra/clangd/SourceCode.cpp:313

 SourceLocation getBeginningOfIdentifier(const Position &Pos,
                                         const SourceManager &SM,
----------------
@kadircet is working on getting rid of this function because creating raw lexers is is wasteful and not actually very powerful. Mostly we're moving to syntax::TokenBuffer, which records actual lexed tokens, but that doesn't apply here.

The examples in the tests seem like they'd be covered by something *really* simple, like enclosing identifier chars:

```
unsigned Begin, End;
for (Begin = Offset; Begin > 0 && isIdentifierBody(Code[Begin-1]); --BeginEnd) {}
for (End = Offset; End < Code.size() && isIdentifierBody(Code[End]); ++End) {}
return Code.slice(Begin, End);
```

(Lexer::isIdentifierBodyChar requires langopts but just passes through DollarIdents to isIdentifierBody, and I don't think we care much about identifiers with $ in them.)

If we really want to do something more subtle here, we should check it in SourceCodeTests.

================
Comment at: clang-tools-extra/clangd/SourceCode.h:93
+/// the entire comment or string token.
+SourceRange getWordAtPosition(const Position &Pos, const SourceManager &SM,
+                              const LangOptions &LangOpts);
----------------
consider moving the isLikelyToBeIdentifier check inside. The current API is pretty general and it's not clear yet what (else) it's good for so it's nice to direct towards intended usage.

Also doing the identifier check inside this function is more convenient when it relies on markers outside the identifier range (like doxygen `\p` or backtick-quoted identifiers)

That said, you may still want to return the range when it's not a likely identifier, with a signature like `StringRef getWordAtPosition(bool *LikelyIdentifier = nullptr)`. I'm thinking of the future case where the caller wants to find a nearby matching token and resolve it - resolving belongs in the caller so there's not much point having this function duplicate the check.

================
Comment at: clang-tools-extra/clangd/SourceCode.h:93
+/// the entire comment or string token.
+SourceRange getWordAtPosition(const Position &Pos, const SourceManager &SM,
+                              const LangOptions &LangOpts);
----------------
sammccall wrote:
> consider moving the isLikelyToBeIdentifier check inside. The current API is pretty general and it's not clear yet what (else) it's good for so it's nice to direct towards intended usage.
> 
> Also doing the identifier check inside this function is more convenient when it relies on markers outside the identifier range (like doxygen `\p` or backtick-quoted identifiers)
> 
> That said, you may still want to return the range when it's not a likely identifier, with a signature like `StringRef getWordAtPosition(bool *LikelyIdentifier = nullptr)`. I'm thinking of the future case where the caller wants to find a nearby matching token and resolve it - resolving belongs in the caller so there's not much point having this function duplicate the check.
This doesn't use the SourceManager-structure of the file, so the natural signature would be `StringRef getWordAtPosition(StringRef Code, unsigned Offset)`.

(what are the practical cases where langopts is relevant?)

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:198
+bool isLikelyToBeIdentifier(StringRef Word) {
+  // Word contains underscore
+  if (Word.contains('_')) {
----------------
nit: mention snake_case, MACRO_CASE?

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:202
+  }
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
----------------
nit: can you mention this catches lowerCamel and UpperCamel

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:203
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
+      StringRef::npos) {
----------------
nit: prefer llvm::isUppercase to avoid locales

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:203
+  // Word contains capital letter other than at beginning
+  if (Word.substr(1).find_if([](char C) { return std::isupper(C); }) !=
+      StringRef::npos) {
----------------
sammccall wrote:
> nit: prefer llvm::isUppercase to avoid locales
this will fire for initialisms like `HTTP`.
I think we want to require *both* upper and lowercase letters.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:222
+      return L.first > R.first;
+    return L.second.Name < R.second.Name; // Earlier name is better.
+  }
----------------
I think this is dead - we're just sorting by score.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:226
+
+std::vector<LocatedSymbol> navigationFallback(ParsedAST &AST,
+                                              const SymbolIndex *Index,
----------------
this function should have a high-level comment describing the strategy and the limitations (e.g. idea of extending it to resolve nearby matching tokens).

A name like `locateSymbolNamedTextuallyAt` would better describe what this does, rather than what its caller does.

I would strongly consider exposing this function publicly for the detailed tests, and only smoke-testing it through `locateSymbolAt`. Having to break the AST in tests or otherwise rely on the "primary" logic not working is brittle and hard to verify.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:240
+  Req.ProximityPaths = {MainFilePath};
+  // FIXME: Set Req.Scopes to the lexically enclosing scopes.
+  // For extra strictness, consider AnyScope=false.
----------------
FWIW the API for this is visibleNamespaces() from SourceCode.cpp.
(No enclosing classes, but I suspect we can live without them once we have a nearby-tokens solution too)

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:243
+  Req.AnyScope = true;
+  // Choose a limit that's large enough that it contains the user's desired
+  // target even in the presence of some false positives, but small enough that
----------------
If we're bailing out on >3, I think this limit should be aiming to detect when there's >3, and avoid fetching way too much data, but *not* trying to avoid noise.
(I'd suggest 10 or so)

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:248
+  TopN<ScoredLocatedSymbol, ScoredSymbolGreater> Top(*Req.Limit);
+  FuzzyMatcher Filter(Req.Query);
+  auto MainFileURI = URIForFile::canonicalize(MainFilePath, MainFilePath);
----------------
This seems dead, you're requiring exact matches, these will always have the same score.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:250
+  auto MainFileURI = URIForFile::canonicalize(MainFilePath, MainFilePath);
+  bool HaveResultInMainFile = false;
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
----------------
This is an interesting signal, I think there are two sensible ways to go about it:
 - assume results in this file are more likely accurate than those in other files. In this case we should at minimum be using this in ranking, but really we should just drop all cross-file results if we have an in-file one.
 - don't rely on index for main-file cases, and rely on "find nearby matching token and resolve it instead". That can easily handled cases defined/referenced in the main-file with sufficient accuracy, including non-indexed symbols. So here we can assume this signal is always false, and drop it.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:251
+  bool HaveResultInMainFile = false;
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
+    auto Loc = symbolToLocation(Sym, MainFilePath);
----------------
BTW I think the answer for constructors is just to drop all constructor results here.

(This also affects template specializations which I think we can not worry about, and virtual method hierarchies which are more painful but I also wouldn't try to fix now)

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:252
+  Index->fuzzyFind(Req, [&](const Symbol &Sym) {
+    auto Loc = symbolToLocation(Sym, MainFilePath);
+    if (!Loc) {
----------------
I'm not sure why we're using SymbolToLocation here:
 - Main file URI check: the `Symbol` has URIs. They need to be canonicalized to file URIs before comparison. This allows checking both decl and def location.
 - PreferredDeclaration and Definition can be more easily set directly from the `Symbol`

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:258
+
+    // For now, only consider exact name matches, including case.
+    // This is to avoid too many false positives.
----------------
I wouldn't bother qualifying this as "for now". Any code is subject to change in the future, but requiring an exact name match for index-based results seems more like a design decision than a fixme.

================
Comment at: clang-tools-extra/clangd/XRefs.cpp:284
+    else {
+      log("Navigation fallback: {0} didn't match query {1}", Sym.Name,
+          Filter.pattern());
----------------
I don't think this should be logged, particularly by default - it doesn't really indicate anything other than we should have a "look up symbol by name" API

(ok, actually I think this is just dead code because we've already checked name above)

================
Comment at: clang-tools-extra/clangd/unittests/XRefsTests.cpp:589

+TEST(LocateSymbol, Textual) {
+  const char *Tests[] = {
----------------
`#ifdef`'d out code is another interesting motivation worth testing.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D72874/new/

https://reviews.llvm.org/D72874