[PATCH] D136723: [include-cleaner] Record main-file macro occurences and includes

Sam McCall via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Tue Nov 8 13:27:46 PST 2022


sammccall added a comment.

In D136723#3915853 <https://reviews.llvm.org/D136723#3915853>, @maryammo wrote:

> This commit causes build failure on  `https://lab.llvm.org/buildbot/#/builders/121/builds/24947`  : 
> I was able to reproduce the failure and by reverting this commit locally it passed, can you please take a look? @sammccall

It looks like this was already fixed in
https://github.com/llvm/llvm-project/commit/d19ba74dee0b9ab553bd8a6ef5b67ff349f4bf13



================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Record.h:81
+    llvm::StringMap<llvm::SmallVector<unsigned>> BySpelling;
+    llvm::DenseMap<const FileEntry *, llvm::SmallVector<unsigned>> ByFile;
+  } Includes;
----------------
kadircet wrote:
> it feels like having a set of includes corresponding to the same fileentry, rather than a single one, feels subtle enough to deserve a comment.
> 
> the only case i can think of is, two logically different files resolving to same physical file (i.e. symlinks). these will actually have completely different spellings, and probably it'll make things hard when deciding which include to "keep" or "insert". are there other cases where we can have multiple includes corresponding to the same fileentry?
Sure:
```
#include "foo.h"
#include "bar.h"
#include "foo.h"
```

it isn't useful, but people write it (by mistake).
Maybe we want to clean it up, maybe we don't, but I don't see much reason to model it wrong on purpose - it doesn't really simplify anything, and may further confuse a confusing situation.

I can add a comment - what do you want it to say?
"There may be more than one #include of the same file" just seems to restate the obvious interpretation of the type here. (And this is the header, so it's not an ideal place to dump a bunch of implementation comments)


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Record.h:54
+struct RecordedPP {
+  // The callback (when installed into clang) tracks macros/includes in this.
+  std::unique_ptr<PPCallbacks> record(const Preprocessor &PP);
----------------
kadircet wrote:
> s/this/the main file/
"this" as in the current object, as the data flow here isn't obvious. Changed to `*this`, clearer?


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Record.h:75
+    //  - for a logical file like <vector>, we check Spelled
+    llvm::SmallVector<const Include *> match(Header H) const;
+
----------------
kadircet wrote:
> hokein wrote:
> > sammccall wrote:
> > > in the prototype I reimplemented this function in clangd, but I expect we can just reuse the RecordedIncludes class instead, copying from clangd's includes list is cheap.
> > > 
> > > (That's one argument for putting this in a different header, which I can do already if you prefer)
> > that's an interesting idea, but we need to do it carefully because of `FileEntry*`, the `Include` structure has a filed of `FileEntry*`, it is not feasible for clangd to propagate it (clangd's `Inclusion` doesn't have it, instead it uses a `HeaderID` which is based on the `fs::UniqueID` to identify a physical file).
> > 
> > But I think this is not something we should worry about at the moment, and the current interfaces look quite good.
> right, let's keep it to the minimum now. we can extend/move-around later on
Good point. We could give Include a UniqueID instead I think. Let's defer this though.


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Record.h:75
+    //  - for a logical file like <vector>, we check Spelled
+    llvm::SmallVector<const Include *> match(Header H) const;
+
----------------
sammccall wrote:
> kadircet wrote:
> > hokein wrote:
> > > sammccall wrote:
> > > > in the prototype I reimplemented this function in clangd, but I expect we can just reuse the RecordedIncludes class instead, copying from clangd's includes list is cheap.
> > > > 
> > > > (That's one argument for putting this in a different header, which I can do already if you prefer)
> > > that's an interesting idea, but we need to do it carefully because of `FileEntry*`, the `Include` structure has a filed of `FileEntry*`, it is not feasible for clangd to propagate it (clangd's `Inclusion` doesn't have it, instead it uses a `HeaderID` which is based on the `fs::UniqueID` to identify a physical file).
> > > 
> > > But I think this is not something we should worry about at the moment, and the current interfaces look quite good.
> > right, let's keep it to the minimum now. we can extend/move-around later on
> Good point. We could give Include a UniqueID instead I think. Let's defer this though.
> that's an interesting idea, but we need to do it carefully because of `FileEntry*`, the `Include` structure has a filed of `FileEntry*`, it is not feasible for clangd to propagate it (clangd's `Inclusion` doesn't have it, instead it uses a `HeaderID` which is based on the `fs::UniqueID` to identify a physical file).
> 
> But I think this is not something we should worry about at the moment, and the current interfaces look quite good.




================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h:52
 struct Symbol {
   enum Kind {
     // A canonical clang declaration.
----------------
hokein wrote:
>  probably worth adding a comment:  the order of the enumerators must match the  the order of template arguments in Storage. 
Done in the base patch. D136710


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h:117
+  llvm::StringRef Spelled;             // e.g. vector
+  const FileEntry *Resolved = nullptr; // e.g. /path/to/c++/v1/vector
+  SourceLocation HashLocation;         // of hash in #include <vector>
----------------
kadircet wrote:
> i think it's worth leaving a comment here mentioning that this can be null (couldn't resolve, or maybe we shouldn't be caring, e.g. logical stl include).
> 
> even better, maybe we should actually have a `Header` here? that way this would be conceptual equivalent of SymbolReference, but for includes?
Done (comment.)

> maybe we should actually have a Header here
A `Header` is conceptually something like a `Matcher<Include>`. If we put Header here then we're communicating what the reference is going to look like: `#include "foo.h"` is going to satisfy a symbol usage in `/path/to/foo.h` or one in a file with `// IWYU pragma private: include "foo.h"`, which we don't know.
(And it could be both, which would no longer be expressible).

Technically you could store an Header that *always* has a FileEntry, but that just seems like obfuscation.


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h:41
+struct Macro {
+  IdentifierInfo *Name;
+  // The location of the Name where the macro is defined.
----------------
kadircet wrote:
> what about just a stringref to name?
That's twice as big, and this is already the critical path for sizeof(SymbolReference) which me can accumulate a lot of.

I don't think it's critical, but I also don't see why StringRef is better. (Less indirection vs carries some semantics + lifetime information)


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h:46
+  bool operator==(const Macro &S) const {
+    return Name == S.Name && Definition == S.Definition;
+  }
----------------
kadircet wrote:
> OOC: can we really have multiple macro identifiers defined at the same source location? I guess it's cheap to compare another pointer, but might be easier if we actually didn't make the identifierinfo part of the identity.
Probably not? Removed from comparison.
(I think it's *simpler* to include it in the comparison, but it doesn't matter much)


================
Comment at: clang-tools-extra/include-cleaner/include/clang-include-cleaner/Types.h:118
+  const FileEntry *Resolved = nullptr; // e.g. /path/to/c++/v1/vector
+  SourceLocation Location;             // of hash in #include <vector>
+  unsigned Line = 0;                   // 1-based line number for #include
----------------
hokein wrote:
> nit: HashLoc seems clearer than `Location`.
> 
> And it looks like `Location` and `Line` feels somewhat redundant -- we can get the Line loc from the `Location` with `SourceManager`. But I think it is fair to hold both, Line is an important property of the include, which will probably widely used.
Agree, we could one of these but it's a bit sad: we need SourceLocation to diagnose things, but Line is much more convenient the rest of the time.

Renamed, leaving this open to see what Kadir thinks too.


================
Comment at: clang-tools-extra/include-cleaner/lib/Record.cpp:20
 
+class PPRecorder : public PPCallbacks {
+public:
----------------
hokein wrote:
> this should be hidden in anonymous namespace, right? 
Right! This used to be friended.


================
Comment at: clang-tools-extra/include-cleaner/lib/Record.cpp:139
+  case Header::Physical:
+    for (unsigned I : ByFile.lookup(H.physical()))
+      Result.push_back(&All[I]);
----------------
kadircet wrote:
> we might have different handles (fileentry*) to the same file (if we end up having multiple filemanagers in action, which is unlikely as things stand today, but still). so what about using uniqueids for comparison here?
> 
> AFAICT, filemanager won't have multiple `fileenrty`s with the same `uniqueid` anyway. so we wouldn't really lose any generality.
This is approximately the same issue as above (`> that's an interesting idea, but we need to do it carefully because of `FileEntry*`).
I agree and it doesn't really add code complexity but it does add subtlety, especially if we use FileEntry in the data structure and UniqueID for matching.
So I think deferring *all* the multi-file-manager concerns makes sense (or handling them all now if you prefer)


================
Comment at: clang-tools-extra/include-cleaner/lib/Record.cpp:28
+                           FileID PrevFID) override {
+    Active = SM.isWrittenInMainFile(Loc);
+  }
----------------
kadircet wrote:
> we can keep a include depth based on the `Reason` and bail out on other operations when depth is not `1` (getFileID on Loc here probably hits the SourceManager cache anyways, so might not be important, but i feel like it's better to not rely on that).
That sounds complicated, wait until it shows up in a profile?

The conceptual stack definitely isn't properly maintained in all situations. e.g. a normal parse leaks 1 entry for... builtins or something, and clangd's parse with preamble leaks 3. I'm not specifically sure there's a problem in this scenario, but it's *not* a trivial thing that obviously can't go wrong.

(& I think if we're afraid to rely on the cache for the file we're *currently lexing* then we should seriously consider fixing the cache instead)


================
Comment at: clang-tools-extra/include-cleaner/unittests/RecordTest.cpp:122
+// Matches an Include with a particular spelling.
+MATCHER_P(spelled, S, "") { return arg.Spelled == S; }
+
----------------
kadircet wrote:
> i know these look more like functions, but convention is to use `PascalCase` for matchers, so `Spelled` instead?
These are functions - I don't think there's a convention here, just confusion and bugs.
OK to fix `named` intead?



================
Comment at: clang-tools-extra/include-cleaner/unittests/RecordTest.cpp:186
+      Header.point("def"));
+  ASSERT_NE(Recorded.MacroReferences.size(), 0u);
+  Symbol OrigX = Recorded.MacroReferences.front().Symbol;
----------------
kadircet wrote:
> nit: `ASSERT_FALSE(Recorded.MacroReferences.empty())`
Why? This means we get "expected empty() is false, got true" instead of "expected size() is 0, got 7", which seems strictly worse...

Added a somewhat-useful printer for SymbolReference and switched to IsEmpty() instead.


================
Comment at: clang-tools-extra/include-cleaner/unittests/RecordTest.cpp:180
+  Inputs.ExtraFiles["header.h"] = Header.code();
+  Inputs.ErrorOK = true; // missing header
+  auto AST = build();
----------------
hokein wrote:
> IIUC, ErrorOK is irrelevant, and should be removed.
Hmm, it's not irrelevant, it means I get to make lots of mistakes in my examples :-)

Removed.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136723/new/

https://reviews.llvm.org/D136723



More information about the cfe-commits mailing list