[PATCH] D139921: [include-cleaner] Ranking of providers based on hints

Wed Dec 14 08:01:17 PST 2022

sammccall added inline comments.

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:84

+/// Represents properties of a symbol provider.
+enum class Hint : uint8_t {
----------------
along with `SymbolLocation`, I think mixing our fundamental types and analysis functions in this header makes the structure harder to understand and think they both belong in `Types.h`

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:85
+/// Represents properties of a symbol provider.
+enum class Hint : uint8_t {
+  None = 0x00,
----------------
hokein wrote:
> IIUC, we're using a union bit for symbols and headers: some properties are for both (e.g. Complete); some properties are just for headers (e.g. Public, Canonical).
> 
> I found that this model is hard to follow (e.g. for each property, I need to keep thinking about whether this property is for just symbols/headers or both, the hint conversion step from Symbol to Header is not obvious).
> 
> What do you think about the following alternative ?
> 
> - define two hints (`SymbolHint`, `HeaderHint`)
> - an explicit conversion from `SymbolHint` to `HeaderHint`
> - `SymbolLocation` class has a `SymbolHint` member (for `Header` as well)
> 
> 
> 
> 
I agree the model isn't obvious, but I think it is the right one and **should be explained**.
Breaking into a bunch of different types would be a mistake IMO. It would make it look less like algebra and more like domain logic. The problem is the algebra is important, e.g. it tells you how to merge hints to rank providers.

I think the model is:
 - we have a "provides" DAG, headers provide locations which provide symbols which satisfy AST nodes
 - these hints are "capabilities" that can be provided by edges (or subpaths).
 - if **any** path from Header => AST Node sees a capability, the header provides that capability.
 - this explains why most hints are expressed positively, and we take the union. (if a header provides a complete definition and an incomplete one, we want to call that "complete")

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:87
+  None = 0x00,
+  /// Complete definition for the symbol, only set for cases that it makes a
+  /// difference. e.g. types or templated functions/classes.
----------------
As discussed elsewhere, we could define this as "provides a generally-usable definition for a symbol", i.e. clear it for template/class forward-decls and set it for everything else. This isn't really more code, and makes it less likely we're going to subtly misinterpret the value.

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:90
+  Complete = 0x01,
+  /// Provides a non-private declaration for the symbol. Only absent in the
+  /// cases where file is explicitly marked as such, e.g. non self-contained or
----------------
hokein wrote:
> I think this is no non-private declaration. The `Public` property is for headers, it indicates the symbol is provided by a non-private header.
I agree with this one: because hints aggregate as union, it would be a mistake to flag a Symbol as Public - that would mean any header that provided it would end up with the `Public` hint, even if the header file is private!

I wonder if we should use a naming-convention for this, e.g. `CompleteSymbol`, `PublicHeader` etc, indicating the layer at which the signal is expected to be set.

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:95
+  /// Declaration is explicitly marked as canonical, e.g. with a IWYU private
+  /// pragma that points at this provider.
+  Canonical = 0x04,
----------------
hokein wrote:
> This comment is hard to follow, it doesn't explain the `Canonical`. From the design doc, it means `NameMatch` and `Annotated` headers, I think it would be nice to document them in the comment.
> 
> I'd suggest avoid using the `Canonical` name, it remains me too much about the canonical decl in clang. Maybe `Preferred`?
+1 to Preferred and also documenting the conditions when a signal is set here in some detail (also for Public and Complete).

I think we really have to understand what the signal is quite precisely to make correct use of it.

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:103
+  Hint Hints;
+  Hinted(T &&Wrapped, Hint H) : T(std::forward<T>(Wrapped)), Hints(H) {}
+};
----------------
std::forward should just be std::move here, T is not a reference type

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:118

 /// A set of locations that provides the declaration.
+std::vector<Hinted<SymbolLocation>> locateSymbol(const Symbol &S);
----------------
missed this in the previous review: both `locateSymbol` and `headersForSymbol` should probably be above `writeHTMLReport`

================
Comment at: clang-tools-extra/include-cleaner/lib/AnalysisInternal.h:121
+
+/// Gets all the providers for a symbol by tarversing each location.
+/// Returned headers are sorted per relevance, i.e. first element is the most
----------------
tarversing -> traversing
sorted per -> sorted by

(uber-nit: `i.e.` should be `e.g.` or simply deleted, as "top element first" is not equivalent to "list is sorted")

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:42
+llvm::SmallVector<Header> ranked(llvm::SmallVector<Hinted<Header>> Headers) {
+  auto Score = [](const Hinted<Header> &H) {
+    return static_cast<int>(H.Hints);
----------------
I would suggest making this function public (or `operator<` on `Hinted<Header>`) and unit-testing it to demonstrate the intended behavior.

The implementation is trivial now but may not always be, and editing enum constants at the end rather than in the correct position seems like an easy mistake to make.

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:51
+               if (LHSScore == RHSScore)
+                 return printHeader(LHS) < printHeader(RHS);
+               return LHSScore > RHSScore;
----------------
doing string rendering in a comparator seems gratuitously wasteful.

call stable_sort instead? the input order was used prior to this patch and seems OK.
(or define a meaningful `operator<` on `Header` and call that, but it need not be done now)

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:60
+  for (auto &H : Headers) {
+    auto SpelledH = printHeader(H);
+    llvm::StringRef SpelledHRef(SpelledH);
----------------
again, this seems unneccesary - switch over the variants instead?

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:89
+    if (!PI)
+      return {{FE, Hint::Public}};
     while (FE) {
----------------
why never canonical?

In general, handling `!PI` as a special case up front seems like a bad idea: the purpose of supporting it being nullable is to allow the code below to used/tested without one. Seems we can get away with guarding PI-bits with `if (PI)` below.
(Avoiding the ternary initializer for `CurrentHints` would be a readability side-benefit!)

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:97
       for (const auto *Export : PI->getExporters(FE, SM.getFileManager()))
-        Results.push_back(Header(Export));
+        Results.emplace_back(Export, CurrentHints);

----------------
hokein wrote:
> I think `Exporter` is the case we need to handle -- the exporter header should have a higher value (we probably need to add another hint bit).
I think we discussed and thought that was a bit unsafe. `export` can simply mean "downstream code can assumes this symbol is available", not "i'm the real owner of these symbols". 

Prominent example: https://github.com/protocolbuffers/protobuf/blob/14e579436fb53d0e8271c4d26d3d88c394d96454/src/google/protobuf/stubs/port.h#L90
(it's not relevant that the exported header in question is a standardish one)

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:100
+      if (auto Verbatim = PI->getPublic(FE); !Verbatim.empty()) {
+        Results.emplace_back(Verbatim, CurrentHints | Hint::Canonical);
         break;
----------------
hokein wrote:
> 
> if FE is `private`, do we want  the `Verbatim` set the public bit? My inclination is yes (the current code behavior is no), it feels counterintuitive that  headers from `getPublic()` is not marked as `Public`.
+1, conceptually this is a reference to a sibling edge in the DAG, not an edge chaining off this one, so it should not inherit hints

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:127
+    Headers.append(applyHints(findHeaders(Loc, SM, PI), Loc.Hints));
+  // We just want to de-duplicate providers by merging them. So use some sort of
+  // ordering to make sure duplicates are grouped together.
----------------
why not define `operator<` on `Header` if you need one? this will come up again.

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:132
+  };
+  // FIXME: This doesn't de-duplicate if the same header gets mentioned in
+  // multiple ways (e.g. one through physical includes and another through a
----------------
not sure if this is really a FIXME: it's a pretty deep consequence of the design that will never be fixed.

Since you're relying on the distinction between "header" and "Header" here, maybe be a bit more explicit? "If two Headers probably refer to the same file (e.g. Verbatim(foo.h) and Physical(/path/to/foo.h), we won't deduplicate them or merge their hints".

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:153
+  {
+    llvm::raw_string_ostream OS(SymbolName);
+    OS << S;
----------------
again, no need to stringify and then reparse here (in fact `operator<<` for Symbol is **extra**-inefficient, assuming that we never call it in algorithms), we have a perfectly good underlying string that's pretty easy to get at in the cases where matching is appropriate

================
Comment at: clang-tools-extra/include-cleaner/lib/FindHeaders.cpp:157
+  llvm::StringRef DeclName(SymbolName);
+  if (auto LastQualifier = DeclName.rfind("::"); LastQualifier != DeclName.npos)
+    DeclName = DeclName.drop_front(LastQualifier);
----------------
BTW This stringify-and-parse isn't quite correct, e.g. for `operator std::string` it yields `operator string` but for `operator typename T::foo()` it yields `foo`!

(I realize we do this in clangd, which isn't great...)

================
Comment at: clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp:161

+class HeadersForSymbolTest : public FindHeadersTest {
+protected:
----------------
hokein wrote:
> I think we should have some tests for the hints bit (verify the hint bits are set correctly).
these tests seem to be in the wrong file

================
Comment at: clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp:161

+class HeadersForSymbolTest : public FindHeadersTest {
+protected:
----------------
sammccall wrote:
> hokein wrote:
> > I think we should have some tests for the hints bit (verify the hint bits are set correctly).
> these tests seem to be in the wrong file
yes, I think this will be easier to maintain & make complete if you test the hints extracted, test the sort order, and only smoke-test the interaction between the two

================
Comment at: clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp:263
+  )cpp");
+  Inputs.ExtraFiles["test/foo.proto.h"] = guard("struct foo;");
+  buildAST();
----------------
This is confusing, the most obvious interpretation is "this is the generated header for the .proto file that defines foo", but such a header would have a complete definition.

I can't think of a remotely realistic proto example that actually triggers this case, so maybe drop the references to proto? It seems to give some false intuitions about whether this behavior is correct.

================
Comment at: clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp:266
+  EXPECT_THAT(headersForFoo(),
+              ElementsAre(physicalHeader("test/foo.proto.h"),
+                          physicalHeader("public_complete.h")));
----------------
hokein wrote:
> The result is correct according to the `(canonical, public, complete)` triple order.
> 
> But we can come up with a counterexample, e.g. 
>  
> ```
> // transitive.h
> #include "public_complete.h"
> 
> // main.cpp
> #include "transitive.h"
> Foo foo;
> ```
> 
> We will remove the `transitive.h` and insert the `foo.proto.h`, then we cause a compiling error (incomplete type of `Foo`). Not sure what we should do about it.
yeah, this is one of the tradeoffs of not examining completeness requirements at use sites.

However it's also a reason to believe that `operator<` on `Hints` may not be a simple lexicographic compare, several "lesser" hints may be able to overpower a "greater" one ==> another reason to make the comparison public & test it.

================
Comment at: clang-tools-extra/include-cleaner/unittests/FindHeadersTest.cpp:280
+  auto &SM = AST->sourceManager();
+  // FIXME: Mainfile should be boosted.
+  EXPECT_THAT(headersForFoo(),
----------------
This comment describes an implementation (IMO is unlikely to be the right one) to solve a problem (which is left implicit) - it would be nice to state the problem instead.

I think we should have some special handling for the main-file, but it's not clear that boosting it is the way to go.

For example, if the main-file provides a forward-decl for a class, we probably shouldn't consider any other decl of that class used. (The forward decl clearly signals that it doesn't intend to rely on some external decl). So maybe `headersForSymbol` should only return the main file, or return some sentinel "provided in main file" value instead of a list.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D139921/new/

https://reviews.llvm.org/D139921