[PATCH] D143496: [clangd] Add support for missing includes analysis.

Fri Feb 17 09:18:12 PST 2023

VitaNuo added a comment.

Thanks for the extensive explanations! See the updated version, it should also take the discussions on the document into account.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:512
+include_cleaner::Includes
+convertIncludes(const SourceManager &SM,
+                const std::vector<Inclusion> &MainFileIncludes) {
----------------
kadircet wrote:
> since this is a local symbol, either mark it as `static` or move it to anonymous namespace above so that it won't be visible to other translation units.
Oh yeah makes sense.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:513
+convertIncludes(const SourceManager &SM,
+                const std::vector<Inclusion> &MainFileIncludes) {
+  include_cleaner::Includes Includes;
----------------
kadircet wrote:
> nit: you can use `llvm::ArrayRef` instead of a `const std::vector &`. `ArrayRef` is a trivially copyable/constructible type for providing views into a consecutive set of elements.
Ok.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:521
+        SM.getComposedLoc(SM.getMainFileID(), Inc.HashOffset);
+    TransformedInc.Line = Inc.HashLine;
+    TransformedInc.Angled = WrittenRef.starts_with("<");
----------------
kadircet wrote:
> include_cleaner::Includes::Line is 1-based, whereas clang::clangd::Inclusion::HashLine is 0-based. so we need to have a `+1` on the RHS. we are probably missing some test coverage if this didn't fail
Ok, got it. We only use the include conversion for matching, and matching seems not to use line numbers. I guess it's the reason nothing fails.
I can't write a test for this function directly since it's in an anonymous namespace.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:523
+    TransformedInc.Angled = WrittenRef.starts_with("<");
+    if (auto FE = SM.getFileManager().getFile(Inc.Resolved))
+      TransformedInc.Resolved = *FE;
----------------
kadircet wrote:
> i don't think we should convert any unresolved includes. it usually means header wasn't found, so we can't perform any reliable analysis on them. anything i am missing?
Ok I will skip unresolved includes. But I am not sure I fully understand. We do the following:
1. Convert clangd includes to include-cleaner includes.
2. Match include-cleaner includes with symbol providers.
3. If match found, symbol reference is satisfied.

How does it matter in this scenario if the include is resolved? AFAIU as long as the header is spelled in the main file + it's matched with a symbol provider, we should say that the symbol reference is satisfied.

Otherwise, it seems like we'll say that the header is missing, although it's there in the main file and unresolved.

I don't know if this is in any way a realistic scenario. I am just approaching it with general logic, and in this sense having more "satisfied" symbols seems better than having less => leads to less false positives. It can lead to false negatives, too, but AFAIU false negatives are much less of a risk for missing include management. 

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:535
+  Diag D;
+  D.Message = llvm::formatv("header {0} is missing", Spelling);
+  D.Name = "missing-includes";
----------------
kadircet wrote:
> i think the symbol name should also be part of the diagnostic. as editors can show these diagnostics without context (e.g. you've got a big file open, there's a diagnostic panel displaying messages/fixes for all of the file. you should be able to reason about the diagnostics and fixes without jumping all over the file). so maybe something like:
> 
> `formatv("{0} providing '{1}' is not directly included", Header, Symbol)`
Sure. The new design does this, as well as skipping the header name.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:549
+  bool Angled = HeaderRef.starts_with("<");
+  std::optional<tooling::Replacement> Replacement = HeaderIncludes.insert(
+      HeaderRef.trim("\"<>"), Angled, tooling::IncludeDirective::Include);
----------------
kadircet wrote:
> this returns nullopt only if Header is already included in the main file. our analysis should never suggest such a header for include, unless the include is coming from a PP-disabled region.
> 
> so i think if Replacement generation fails, we should drop the diagnostic completely rather than just dropping the fix. WDYT?
Ok, sure. What's a PP-disabled region? Are you talking about #ifdef's and such?

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:590
+                                 include_cleaner::Header Provider) {
+  std::string SpelledHeader;
+  if (Provider.kind() == include_cleaner::Header::Physical) {
----------------
kadircet wrote:
> nit: you can rewrite this as:
> ```
> // Give URI schemes a chance to customize header spellings
> if(Provider.kind() == Physical) {
>    if(auto CanPath = getCanonicalPath(Provider.physical(), AST.getSourceManager())) {
>          // URI::includeSpelling only fails when URI scheme is unknown. Since we're creating URI ourselves here, we can't get an unknown scheme.
>          std::string Header = llvm::cantFail(URI::includeSpelling(URI::create(*CanonicalPath));
>          if (!Header.empty())
>              return Header;
>    }
> }
> return spellHeader(...);
> ```
Ok, great. Didn't know it was Ok to test `std::optional` directly in an `if` clause.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:636
+        bool Satisfied = false;
+        for (const auto &H : Providers) {
+          if (!IncludeCleanerIncludes.match(H).empty()) {
----------------
kadircet wrote:
> you can directly use `IncludeCleanerIncludes` now, without the need for matching the header against clangd's Includes. e.g:
> 
> ```
> bool Satisfied = false;
> for (auto &H : Providers) {
>    if (H.kind() == Physical && H.physical() == MainFile) {
>       Satisfied = true;
>       continue;
>    }
>    for (auto &Inc : IncludeCleanerIncludes.match(H)) {
>       Satisfied = true;
>       auto HeaderID = Includes.getID(Inc.Resolved);
>       assert(HeaderID.has_value() && "IncludeCleanerIncludes only contain resolved includes.");
>       Used.insert(*HeaderID);
>    }
> }
> ```
> 
> this way you can also get rid of the logic that generates `BySpelling` mapping above.
Oh this is cool. Didn't realize we can do straight from a resolved include to the ID. Thanks.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:662
+          std::string SpelledHeader =
+              resolveSpelledHeader(AST, MainFile, Providers.front());
+
----------------
kadircet wrote:
> sorry wasn't thinking this through in the initial set of comments. but i think we should be grouping findings per Symbol rather than per Header spelling, because:
> - for reasons i pointed elsewhere, we should be including symbol name in the diagnostic messages
> - again as pointed elsewhere, we should have diagnostics attached to all references not just the first one.
> - in the future we'll likely extend this to cover providing fixes for alternative providers. we shouldn't be emitting multiple diags for each header but rather have multiple fixes on the same diagnostic.
> 
> so probably a mapping like:
> 
> llvm::DenseMap<Symbol, pair</*Providers*/vector<Header>, /*reference locations*/vector<Range>>; // instead of a pair, a dedicated struct would look better
> 
> we also should delay spelling of the Header, as it's expensive and we might not be diagnosing missing includes.
> 
> later on we can consume the whole map at once and convert them into a set of clangd::Diag. WDYT?
Yes, storing per Symbol totally makes sense. Let's discuss specifics in the corresponding document.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:668
+          if (SpelledForExpanded.has_value())
+            MissingIncludes.insert(
+                {SpelledHeader, std::move(SpelledForExpanded.value())});
----------------
kadircet wrote:
> kadircet wrote:
> > any reason for storing tokens? you can store a range directly here as a `clangd::Range`, or a `syntax::FileRange` if you don't want to pay for `offsetToPosition` calls if we're not going to emit.
> this is not a multimap, so we're only retaining the range for a missing header on the first range that referenced a symbol provided by it. i think we should be collecting all the reference ranges instead. e.g. if i have a file:
> ```
> #include <all>
> 
> void foo() {
>   std::string x;
>   std::string y;
> }
> ```
> 
> there should be missing include diagnostics attached to both mentions of `std::string` not only the first (again in big files it might be impossible to see the first range the diagnostic is attached to and people have a tendency to only care about the parts of the code they've touched). does that make sense?
Yes, this comment goes along the lines of the design discussion we are having at the moment. 

> again in big files it might be impossible to see the first range the diagnostic is attached to and people have a tendency to only care about the parts of the code they've touched

this is AFAIU in conflict with the suggestion that the diagnostic should only be attached to the first reference.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:668
+          if (SpelledForExpanded.has_value())
+            MissingIncludes.insert(
+                {SpelledHeader, std::move(SpelledForExpanded.value())});
----------------
VitaNuo wrote:
> kadircet wrote:
> > kadircet wrote:
> > > any reason for storing tokens? you can store a range directly here as a `clangd::Range`, or a `syntax::FileRange` if you don't want to pay for `offsetToPosition` calls if we're not going to emit.
> > this is not a multimap, so we're only retaining the range for a missing header on the first range that referenced a symbol provided by it. i think we should be collecting all the reference ranges instead. e.g. if i have a file:
> > ```
> > #include <all>
> > 
> > void foo() {
> >   std::string x;
> >   std::string y;
> > }
> > ```
> > 
> > there should be missing include diagnostics attached to both mentions of `std::string` not only the first (again in big files it might be impossible to see the first range the diagnostic is attached to and people have a tendency to only care about the parts of the code they've touched). does that make sense?
> Yes, this comment goes along the lines of the design discussion we are having at the moment. 
> 
> > again in big files it might be impossible to see the first range the diagnostic is attached to and people have a tendency to only care about the parts of the code they've touched
> 
> this is AFAIU in conflict with the suggestion that the diagnostic should only be attached to the first reference.
> any reason for storing tokens? 

I was primarily avoiding `clang::Range` since it requires `llvm::Code` to build ranges, and didn't want `computeIncludeCleanerFindings` to depend on anything but the AST. But `syntax::FileRange` sounds good.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:729
+  }
+  for (const llvm::StringMapEntry<llvm::ArrayRef<syntax::Token>> &Missing :
+       MissingIncludes) {
----------------
kadircet wrote:
> convention is definitely to use `auto` in place of such detailed type names. e.g. `auto &Entry : MissingIncludes`
sure.

================
Comment at: clang-tools-extra/clangd/IncludeCleaner.cpp:547
+convertIncludes(const SourceManager &SM,
+                std::vector<Inclusion> MainFileIncludes) {
+  include_cleaner::Includes Includes;
----------------
kadircet wrote:
> VitaNuo wrote:
> > kadircet wrote:
> > > you can just pass an `llvm::ArrayRef<Inclusion>` to prevent a copy
> > By preventing a copy, do you mean that the construction of `llvm::ArrayRef<Inclusion>` will only copy a pointer to the data rather than the whole vector? AFAIU `const std::vector<Inclusion>&` should be even better then, no copies involved. CMIIW.
> well from performance-wise they're pretty close, passing by const ref doesn't mean you don't do any copies, it'll still require address of the entity to be signalled somehow, which is a pointer copy. passing an arrayref implies copying a pointer to data and the size (so it's slightly worse in that regard). but it can represent any chunk of contiguous memory, the data source doesn't need to be a vector. moreover you can easily pass a slice of a vector rather than the whole vector etc.
> 
> the performance implications rarely matters in practice, and the abstraction it provides on the interfaces is usually a benefit, e.g. if we were to change underlying type from vector to llvm::SmallVector, the APIs would still work and you don't need to think about any concrete types. hence we tend to prefer having ArrayRef on API boundaries whenever possible.
Thank you for the great explanation!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143496/new/

https://reviews.llvm.org/D143496