[cfe-dev] RFC: Easier AST Matching by Default

Sat Jan 18 07:09:24 PST 2020

On Fri, Jan 17, 2020 at 7:55 AM Dmitri Gribenko <gribozavr at gmail.com> wrote:
>
> On Fri, Jan 17, 2020 at 12:41 PM Stephen Kelly <steveire at gmail.com> wrote:
>>
>>
>> On 13/01/2020 16:39, Dmitri Gribenko wrote:
>>
>> On Fri, Jan 10, 2020 at 4:34 PM Aaron Ballman via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>>
>>> I've not seen much to suggest the community is not in favor of this
>>> direction, so I think you're all set to post patches for review. If
>>> there are community concerns, we can address them during the review
>>> process. Thank you for working on this!
>>
>>
>> Sorry, I just came back from vacation. I'm not sure if it is a good idea, in particular, for the following reasons:
>>
>> - The document specifies the goals in terms of a solution: "Goal: Users should not have to account for implicit AST nodes when writing AST Matchers".
>>
>>
>> The goal is to make AST Matchers more accessible/easy to use for newcomers/non-experts in the clang AST by no-longer confronting them with 100% of the complexity on first use.
>>
>> I think it's important to bridge that gap. See my EuroLLVM talk for more.
>
> I totally agree with the goal of making AST Matchers, refactoring, and tooling more accessible (which is why I'm helping with libraries like Stencil and Syntax trees). However the goal that I quoted from the proposal on this thread ("Goal: Users should not have to account for implicit AST nodes when writing AST Matchers") constrains the solution to exactly what is being proposed.

FWIW, I think the proposed solution is a valuable goal. Implicit nodes
are *hard* for even experienced users to deal with and one of the more
common pieces of review feedback I find myself giving users boils down
to things like "do you need to ignore paren expressions here" and the
likes, and that feedback is inconsistent because I'm a human and
despite the massive amount of clang-tidy code I review, *I* still
forget about implicit nodes myself. I think that having a traversal
mode which ignores implicit nodes is a valuable feature and that
needing to match implicit nodes in AST matchers is the 5% case rather
than the 95% case.

>> - The complexity for implementers (although it also applies to https://reviews.llvm.org/D61837, which landed a while ago, and I didn't see it). Allowing to run the same matcher in multiple modes increases our support and testing matrix.
>>
>> I don't think it increases the test matrix more than any other feature. Can you expand on this?
>
> https://reviews.llvm.org/D61837 makes it possible to run every matcher with different traversal semantics. Our current tests only test one mode for each matcher, the default mode. There are two ways to define custom matchers: the advanced way (through the AST_MATCHER macro) that is needed when you want to traverse the AST yourself, and a simpler way -- by combining existing matchers.
>
> Consider the `makeIteratorLoopMatcher` defined in clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp. The caller can run the newly made matcher in any traversal mode, even though the matcher was probably written with the default mode in mind.
>
> Clang codebase does not have many matchers that are combinations of existing matchers, but code that uses Clang as a library (and often written by engineers who are not compiler experts and just need to get some refactoring done) does that very often in my experience working at Google.

clang-tidy uses a fair amount of matchers that are combinations of
existing matchers, but they're localized to the individual checks. The
AST matcher implementation itself composes matchers as well. Composed
matchers is something that the proposed functionality will have to
take into account and I fully agree that this feature will require
additional testing, but I don't see anything here that suggests this
is an undue increase in testing or maintenance burden based on what
you described. Maybe I'm looking at it from the wrong angle still,
though.

>> - The complexity cliff for users. Many non-trivial matchers need to explicitly manage implicit AST nodes.
>>
>>
>> You just described the complexity cliff which is (after my change) not the first thing a newcomer hits.
>>
>> It's good to move that cliff back so that the newcomer is not confronted with it immediately.
>
> As soon as you pretty-print the AST (which I think most people should do before writing a matcher), you see implicit nodes.

I think you mean "when you dump the AST" as opposed to "pretty-print
the AST". When you pretty print, you do not typically see implicit
nodes because we're reconstructing the source. When you AST dump, you
currently see implicit AST nodes but 1) they're typically explicitly
marked as being implicit nodes, and 2) there's no reason we can't give
an option to control AST output generation. In support of point #2,
there's already a need for this with the JSON AST dumper to be able to
dump in compact vs non-compact mode, but we haven't devised a general
way to control AST dump output yet. Selecting whether to dump implicit
or explicit nodes is just another kind of option along those lines.

>> - The cost of ClangTidy. ClangTidy runs all matchers simultaneously in one AST traversal. If we implement implicit/no-implicit modes as separate AST traversals, ClangTidy would need to run two traversals, one for "easier" matchers (skipping implicit code), and one for "expert" matchers (traversing implicit code). We would also need to build two parent maps.
>>
>>
>> Will Syntax Trees ever be used in clang-tidy? Or will their use be forbidden in clang-tidy for the same reason of adding to a runtime cost when used in a mixture?
>>
>> If your concern is valid it would seem to apply to Syntax Trees to a greater extent than this patch does.
>
> Syntax trees will be used in ClangTidy once they are more fleshed out.

Is this a hope, or was this a decision made already?

> You are absolutely right about syntax trees having their cost. However, the runtime cost of syntax trees comes with a benefit of actually providing an API that was designed to model syntax trees. The proposal to visit only explicitly written AST nodes will only make matching look like it matches syntax, but it won't help with any further inspection of the AST nodes, mutation, or even viewing the AST before you write a matcher.
>>
>> That said, I think there is some scope for optimization in my approach. The optimizations I have in mind (being more efficient with parent maps) wouldn't apply to Syntax Trees though as that is a completely separate system.
>>
>>
>>
>> My best suggestion is to investigate implementing AST Matchers for syntax trees, and allow jumping between syntax and semantic nodes in one matcher -- to match syntax first, and then validate the necessary semantic constraints.
>>
>> My previous email quoted the Syntax Trees RFC saying that the first step is to "use [the current ASTMatchers API] to find interesting Clang AST nodes". You seem to be suggesting that the opposite would be done. Maybe things have evolved since that RFC, or maybe both approaches should work equally well.
>
>
> I discussed it with Ilya in person and we concluded that the way it came through as written in the RFC did not reflect our understanding of Syntax Trees. See Ilya's response on this thread.
>
>>
>> As long as the current AST Matchers APIs exist, they should be improved. Perhaps they will be less important in the future when we have Syntax Trees for these use-cases instead. However, for now, I don't think the ease of use improvement in my patches should be prevented based existence of a future alternative system.
>
> I agree that we should not block improvements on existence of a future system. However, each specific change should be evaluated based on its merits and costs. I think there are non-trivial downsides to offering a different mode in existing AST matchers, which create maintenance and testing costs that we will be paying perpetually.

I still don't see these costs as being undue, however, I definitely
see the pain point this RFC is intended to lessen. It's something that
comes up with a high degree of frequency and I'm worried about
ignoring that pain because we hope a totally new approach will pan out
in the future despite any practical experience with that new approach.
That said, we've lived with requiring used to match on implicit nodes
for this long, it would not kill us to wait longer if we needed to.

> It instantly creates a huge testing gap for existing AST matchers (which we currently test only in one mode). It also creates traps for existing code that will be able to start accidentally calling existing matchers in a traversal mode that they were not designed for.

I think the testing gap can be solved with software solutions
(judicious use of macros and/or templates). The traps for existing
code are a bit of a concern, but truthfully, no more of a concern than
use of StringRef is. If the composed matcher is publicly exposed, it's
something the author definitely has to account for. If the composed
matcher is private to a particular check or other usage (which is the
majority case as best I can tell), then the author really only has to
care about the mode being used for that usage.

> On the other hand, my objection primarily applies to https://reviews.llvm.org/D61837, which already landed, not to changing the default. Changing the default just exacerbates the problem significantly by affecting the semantics of existing code silently. However, Clang does not aim to provide a stable API even in the tooling space. So maybe it is too late for me to object, and unless we also remove the traverse() matchers, my objection can't be fully addressed.

FWIW, this was something I was concerned by as well, but eventually I
was okay with it specifically because we make no stability guarantees
in this space as it stands. I'd love it if the breakage could be
non-silent though.

> If you want to still proceed with changing the default, my request would be to provide detailed guidance (probably as a section in release notes) about updating existing code that uses matchers, including ClangTidy checkers.

Absolutely agreed that this would be necessary.

~Aaron

>
> Dmitri
>
> --
> main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if
> (j){printf("%d\n",i);}}} /*Dmitri Gribenko <gribozavr at gmail.com>*/