[PATCH] D143496: [clangd] Add support for missing includes analysis.

Wed Feb 15 05:10:55 PST 2023

kadircet added a comment.

Thanks! Details mostly looks good, but i think we had different ideas about how the final diagnostics should look like. Let's try to clear that up a bit. My suggested proposal is:

- Having a main diagnostic for each symbol that doesn't have a satisfying include in the main file, attached to the first use of the symbol, with message `<foo> providing 'ns::Foo' is not directly included`, with severity warning
- Having notes attached to this main diagnostic, on rest of the references of the same symbol inside the main file, with message `Also used here`, with severity info
- Having a single fix attached to main diagnostic inserting the first provider.

I believe in the future the main diagnostic message should change to `No header providing 'ns::Foo' is directly included`, and we should have multiple fixes attached. But there's value in explicitly spelling the header name when there's only one fix to get rid of indirections.

This means:

  + we'll emit different diagnostics for each unsatisfied symbol even if they're satisfied by the same provider.
  + the user will be able to see usage sites for each symbol as a whole (as we're attaching them as notes), and if need be can make a decision around dropping the dependency or introducing a forward decl.
  + we won't "butcher" the diagnostic panel with 100 duplicated diagnostics all talking about the same thing.
  - in certain editors, like vscode, related information doesn't show up as extra decorations in code. so if the user ignores diagnostic panel and the first usage of a symbol is not visible. they'll likely miss the finding.

WDYT about this approach? any other alternatives you've in mind? I think it might make sense to get others opinion's here as well. Especially around the diagnostic message, as usually whatever is convenient to a single person isn't necessarily clear/obvious for every one :D

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:516
+
+  std::vector<include_cleaner::SymbolReference> Macros =
+      collectMacroReferences(AST);
----------------
VitaNuo wrote:
> kadircet wrote:
> > 
> Why would you use `auto` here? The return type is not obvious from the function call.
> 
> The style guide says: "types that you and your reviewer experience as unnecessary clutter will very often provide useful information to others. For example, you can assume that the return type of `make_unique<Foo>()` is obvious, but the return type of `MyWidgetFactory()` probably isn't." 
> (http://go/cstyle#Type_deduction)
well, feel free to keep it. but at least, inside clangd codebase, we use auto frequently (maybe more than we should). especially in cases like this where a declaration would otherwise fit a single line and because the name of the initializer implies this will be a container of macro references and any extra details about the container is probably irrelevant.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:547
+convertIncludes(const SourceManager &SM,
+                std::vector<Inclusion> MainFileIncludes) {
+  include_cleaner::Includes Includes;
----------------
VitaNuo wrote:
> kadircet wrote:
> > you can just pass an `llvm::ArrayRef<Inclusion>` to prevent a copy
> By preventing a copy, do you mean that the construction of `llvm::ArrayRef<Inclusion>` will only copy a pointer to the data rather than the whole vector? AFAIU `const std::vector<Inclusion>&` should be even better then, no copies involved. CMIIW.
well from performance-wise they're pretty close, passing by const ref doesn't mean you don't do any copies, it'll still require address of the entity to be signalled somehow, which is a pointer copy. passing an arrayref implies copying a pointer to data and the size (so it's slightly worse in that regard). but it can represent any chunk of contiguous memory, the data source doesn't need to be a vector. moreover you can easily pass a slice of a vector rather than the whole vector etc.

the performance implications rarely matters in practice, and the abstraction it provides on the interfaces is usually a benefit, e.g. if we were to change underlying type from vector to llvm::SmallVector, the APIs would still work and you don't need to think about any concrete types. hence we tend to prefer having ArrayRef on API boundaries whenever possible.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:512
+include_cleaner::Includes
+convertIncludes(const SourceManager &SM,
+                const std::vector<Inclusion> &MainFileIncludes) {
----------------
since this is a local symbol, either mark it as `static` or move it to anonymous namespace above so that it won't be visible to other translation units.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:513
+convertIncludes(const SourceManager &SM,
+                const std::vector<Inclusion> &MainFileIncludes) {
+  include_cleaner::Includes Includes;
----------------
nit: you can use `llvm::ArrayRef` instead of a `const std::vector &`. `ArrayRef` is a trivially copyable/constructible type for providing views into a consecutive set of elements.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:521
+        SM.getComposedLoc(SM.getMainFileID(), Inc.HashOffset);
+    TransformedInc.Line = Inc.HashLine;
+    TransformedInc.Angled = WrittenRef.starts_with("<");
----------------
include_cleaner::Includes::Line is 1-based, whereas clang::clangd::Inclusion::HashLine is 0-based. so we need to have a `+1` on the RHS. we are probably missing some test coverage if this didn't fail

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:523
+    TransformedInc.Angled = WrittenRef.starts_with("<");
+    if (auto FE = SM.getFileManager().getFile(Inc.Resolved))
+      TransformedInc.Resolved = *FE;
----------------
i don't think we should convert any unresolved includes. it usually means header wasn't found, so we can't perform any reliable analysis on them. anything i am missing?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:535
+  Diag D;
+  D.Message = llvm::formatv("header {0} is missing", Spelling);
+  D.Name = "missing-includes";
----------------
i think the symbol name should also be part of the diagnostic. as editors can show these diagnostics without context (e.g. you've got a big file open, there's a diagnostic panel displaying messages/fixes for all of the file. you should be able to reason about the diagnostics and fixes without jumping all over the file). so maybe something like:

`formatv("{0} providing '{1}' is not directly included", Header, Symbol)`

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:545
+                          offsetToPosition(Code, Range.endOffset())};
+  tooling::HeaderIncludes HeaderIncludes(FileName, Code,
+                                         tooling::IncludeStyle{});
----------------
this is actually an expensive object to create (it scans the file for existing includes). so we should create this once for the whole file. i think it's better to generate whole set of diagnostics in this function directly, rather than creating one by one.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:549
+  bool Angled = HeaderRef.starts_with("<");
+  std::optional<tooling::Replacement> Replacement = HeaderIncludes.insert(
+      HeaderRef.trim("\"<>"), Angled, tooling::IncludeDirective::Include);
----------------
this returns nullopt only if Header is already included in the main file. our analysis should never suggest such a header for include, unless the include is coming from a PP-disabled region.

so i think if Replacement generation fails, we should drop the diagnostic completely rather than just dropping the fix. WDYT?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:561
+
+Diag computeUnusedIncludesDiagnostic(std::string FileName, const Inclusion *Inc,
+                                     llvm::StringRef Code) {
----------------
s/computeUnusedIncludesDiagnostic/generateUnusedIncludeDiagnostic/

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:588
+
+std::string resolveSpelledHeader(ParsedAST &AST, const FileEntry *MainFile,
+                                 include_cleaner::Header Provider) {
----------------
s/resolveSpelledHeader/spellHeader/

again make this static or put into anon namespace above

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:590
+                                 include_cleaner::Header Provider) {
+  std::string SpelledHeader;
+  if (Provider.kind() == include_cleaner::Header::Physical) {
----------------
nit: you can rewrite this as:
```
// Give URI schemes a chance to customize header spellings
if(Provider.kind() == Physical) {
   if(auto CanPath = getCanonicalPath(Provider.physical(), AST.getSourceManager())) {
         // URI::includeSpelling only fails when URI scheme is unknown. Since we're creating URI ourselves here, we can't get an unknown scheme.
         std::string Header = llvm::cantFail(URI::includeSpelling(URI::create(*CanonicalPath));
         if (!Header.empty())
             return Header;
   }
}
return spellHeader(...);
```

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:613
+  const auto &Includes = AST.getIncludeStructure();
+  include_cleaner::Includes IncludeCleanerIncludes =
+      convertIncludes(AST.getSourceManager(), Includes.MainFileIncludes);
----------------
s/IncludeCleanerIncludes/ConvertedIncludes/ ?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:636
+        bool Satisfied = false;
+        for (const auto &H : Providers) {
+          if (!IncludeCleanerIncludes.match(H).empty()) {
----------------
you can directly use `IncludeCleanerIncludes` now, without the need for matching the header against clangd's Includes. e.g:

```
bool Satisfied = false;
for (auto &H : Providers) {
   if (H.kind() == Physical && H.physical() == MainFile) {
      Satisfied = true;
      continue;
   }
   for (auto &Inc : IncludeCleanerIncludes.match(H)) {
      Satisfied = true;
      auto HeaderID = Includes.getID(Inc.Resolved);
      assert(HeaderID.has_value() && "IncludeCleanerIncludes only contain resolved includes.");
      Used.insert(*HeaderID);
   }
}
```

this way you can also get rid of the logic that generates `BySpelling` mapping above.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:662
+          std::string SpelledHeader =
+              resolveSpelledHeader(AST, MainFile, Providers.front());
+
----------------
sorry wasn't thinking this through in the initial set of comments. but i think we should be grouping findings per Symbol rather than per Header spelling, because:
- for reasons i pointed elsewhere, we should be including symbol name in the diagnostic messages
- again as pointed elsewhere, we should have diagnostics attached to all references not just the first one.
- in the future we'll likely extend this to cover providing fixes for alternative providers. we shouldn't be emitting multiple diags for each header but rather have multiple fixes on the same diagnostic.

so probably a mapping like:

llvm::DenseMap<Symbol, pair</*Providers*/vector<Header>, /*reference locations*/vector<Range>>; // instead of a pair, a dedicated struct would look better

we also should delay spelling of the Header, as it's expensive and we might not be diagnosing missing includes.

later on we can consume the whole map at once and convert them into a set of clangd::Diag. WDYT?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:668
+          if (SpelledForExpanded.has_value())
+            MissingIncludes.insert(
+                {SpelledHeader, std::move(SpelledForExpanded.value())});
----------------
any reason for storing tokens? you can store a range directly here as a `clangd::Range`, or a `syntax::FileRange` if you don't want to pay for `offsetToPosition` calls if we're not going to emit.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:668
+          if (SpelledForExpanded.has_value())
+            MissingIncludes.insert(
+                {SpelledHeader, std::move(SpelledForExpanded.value())});
----------------
kadircet wrote:
> any reason for storing tokens? you can store a range directly here as a `clangd::Range`, or a `syntax::FileRange` if you don't want to pay for `offsetToPosition` calls if we're not going to emit.
this is not a multimap, so we're only retaining the range for a missing header on the first range that referenced a symbol provided by it. i think we should be collecting all the reference ranges instead. e.g. if i have a file:
```
#include <all>

void foo() {
  std::string x;
  std::string y;
}
```

there should be missing include diagnostics attached to both mentions of `std::string` not only the first (again in big files it might be impossible to see the first range the diagnostic is attached to and people have a tendency to only care about the parts of the code they've touched). does that make sense?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:720
+
   std::string FileName =
       AST.getSourceManager()
----------------
nit: while here you can replace all uses of `FileName` to `AST.tuPath()`

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:729
+  }
+  for (const llvm::StringMapEntry<llvm::ArrayRef<syntax::Token>> &Missing :
+       MissingIncludes) {
----------------
convention is definitely to use `auto` in place of such detailed type names. e.g. `auto &Entry : MissingIncludes`

================
Comment at: clang-tools-extra/clangd/ParsedAST.cpp:694
+        issueIncludeCleanerDiagnostics(Result, Inputs.Contents);
     Result.Diags->insert(Result.Diags->end(),
+                         make_move_iterator(IncludeCleanerDiags.begin()),
----------------
nit: `llvm::move(issueIncludeCleanerDiagnostics(...), std::back_inserter(*Result.Diags))`

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143496/new/

https://reviews.llvm.org/D143496