[PATCH] D120306: [clangd] IncludeCleaner: Add support for IWYU pragma private

Tue Feb 22 06:15:55 PST 2022

sammccall added a comment.

In D120306#3337490 <https://reviews.llvm.org/D120306#3337490>, @kbobyrev wrote:

> In D120306#3337374 <https://reviews.llvm.org/D120306#3337374>, @sammccall wrote:
>
>> In D120306#3337288 <https://reviews.llvm.org/D120306#3337288>, @kbobyrev wrote:
>>
>>> In D120306#3337212 <https://reviews.llvm.org/D120306#3337212>, @sammccall wrote:
>>>
>>>> What's the policy this patch intends to implement?
>>>>
>>>> I'm a little concerned by building up maps of filenames with segment limits - it looks like some kind of heuristic/partial matching.
>>>
>>> Basically, handling the mappings that. IWYU pragmas provide.
>>
>> Yes, I'm asking you to describe that handling in a couple of sentences so I can understand what you intend it to be.
>>
>> e.g. for includecleaner in general, the approach is: walk the AST, map each node to decls it uses, map decls to locations, map locations to files, then check which includes point to files that were not identified.
>
> I'm trying to reuse the information about `IWYU pragma private` that is collected in `Canoncalncludes` to find the header responsible for including the private ones. Canonicalncludes stores information about the headers with `IWYU pragma: prviate, include <XYZ>`.

This makes sense, though note that CanonicalIncludes predates IncludeStructure and had a pretty narrow purpose, it _might_ make more sense to record/store the data elsewhere, we should consider all options.

> When we're figuring out the responsible header in IncludeCleaner, I'm checking if the header has this pragma and then I'm trying to find the `<XYZ>` header that is mentioned in the pragma comment.
> For that I'm using a heuristic that the include stack of the main file should have a header whose suffix is <XYZ>.

So all we *know* is that the header that should be included can be spelled `<XYZ>`.
We don't know that it's on the include stack for the header defining the symbol, but we also don't know that it's included at all! (This is similar to the stdlib case).

It's true that for "unused include" warning we can ignore it if not included.
Still, it seems like a bad sign that we're trying to model headers referenced in this was as FileIDs rather than e.g. strings.
(And it will be harder to reuse for "missing include" warnings).

> This because we never really have the full header name in the pragma comment an?d
> it might not be visible from the private header (and hence can not be resolved from there)

Do you have an example? For include insertion purposes, we assume the header name can be included verbatim.
That implies it should be resolvable against the include path of any TU that sees the private header. (That said, performing IO might have its own problems)

> An alternative would probably be:
>
> - Walk #include directives and try to find the "umbrella headers"
> - For each include directive, query the included file and figure out if it has "IWYU pragma: private, include <XYZ>" through `CanonicalIncludes`
> - If it does, figure out if the supposed umbrella header name has the `<XYZ>` suffix, in that case attribute the private header to this public one
>
> This all would need to be done after `Canonicalncludes` are complete, meaning this would probably belong to `IncludeStructure` instead. Maybe this is more precise, WDYT?

I don't really understand this alternative, because "try to find the umbrella headers" seems to be begging the question.
What is an umbrella header apart from one referenced by an IWYU private pragma, and how would we identify them?

>> - whether we're comparing filenames among a small (e.g. the include stack - heuristics OK) or a large set (need to be very precise)
>
> Yes, I'm using just the include stack of the main file and the preamble. When you say "large set", do you mean the whole set of project headers?

Small set = include stack = chain of includes from main file -> private header. (But public header may not be on that stack).
Large set = include graph = all preamble headers.

> The problem here is that for the include stack we would miss the right diagnostic if the public header responsible for the private one is not accessible. Maybe we should throw an "unused" warning either way here, I don't know what would be right.
>
>> - whether the partial matching is semantically important or a performance optimitzation
>
> The matching itself is an optimisation, right: what I'm dealing with is that `gtest.h` header is called `/home/user/llvm-project/llvm/util/googletest/.../gtest/gtest.h` but the private headers will only say `include "gtest/gtest.h"`. What I'm trying to do is find the header whose real path name ends with `gtest/gtest.h`.

There's a few issues here:

- there might be multiple `"gtest/gtest.h"` (`#include_next`) though typically only in the stdlib
- ignoring the include path might lead to an incorrect match (`private; include "config.h"` should not match `<llvm/config.h>`)
- "real" path name *sounds* nice, but may not be the right one

I think the thing we're really trying to express is: `#include "gtest/gtest.h"` is not unused.
We can loosen that to match `#include <gtest/gtest.h>`, `#include "path/gtest/gtest.h"` etc if we want, but ideally we'd do that explicitly rather than as a side-effect of the implementation.

Again, I think the clearest way to express this intent is simply to bubble the string "gtest/gtest.h" all the way up to the "which inclusions are unused" check.

>> - whether the concept of "real name" is significant or likely to cause problems
>
> Hm, I don't know; what kind of problems do you think might appear?
>
>> - whether storing another copy of the names of all the transitive headers is actually necessary for what you're trying to do
>
> The problem here is that I'll need `FileID` in `headerResponsbile` ut I can't reuse `FileID`s between preamble and main file, right?

When we'd discussed this in the past, we'd talked about doing this in HeaderID space instead of FileID space.
headerResponsible() is the right API/point in the code for non-include-guarded headers, but may not be the right fit for pragmas, even if it means a bit of unfortunate duplication.

(Again, I think strings are probably better than HeaderIDs )

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120306/new/

https://reviews.llvm.org/D120306