[cfe-dev] [RFC] Improving Support for Debugging Matchers

Mon Nov 29 10:18:34 PST 2021

Thank you for proposing this! I think making AST matchers easier to
use and debug is a fantastic goal. Some thoughts below.

On Tue, Nov 23, 2021 at 1:20 PM James King via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>
> Objective
>
> The goal for this proposal is to improve the experience of debugging Clang’s AST matchers and of understanding what they are doing.
>
> Background
>
> AST Matchers are a powerful tool that allows users to pick out nodes in the Clang AST, provided they match the specified constraints and criteria. Yet, creating a well-formed matcher can be difficult. Matchers can become complex when multiple layers of matchers are nested to create a larger compound matcher, which can be necessary to select parts of the AST that fulfill strict criteria. Since matchers only provide a boolean response as to whether they obtained any matches or did not obtain any matches, the user receives no explanation for when a matcher fails to produce any bindings. Even when a matcher produces bindings, performing introspection to debug false positive bindings can be tedious. When the result of a matcher is not expected, determining how the result was obtained is made difficult by this limited response. Without any other options to interact with matchers, other than trial-and-error based on binary feedback, this lack of transparency makes it difficult to introspect matcher behavior and identify problems (such as which part of a compound matcher is inconsistent with expectations). This often causes the process of debugging a faulty matcher to be slow, confusing, and frustrating.
>
> Design ideas
>
> Proposed Solution
>
> The proposed solution is composed of two parts.
>
> The first part (https://reviews.llvm.org/D113917) is to expose a getName method on the Matcher class. Currently, there is no easy way to access any form of identification information for any matcher. This method will act as a hook that allows for accessing the name of a matcher programmatically. While this method will be helpful for the implementation for the withTag matcher, it could also prove to be useful for other forms of matcher introspection/debugging.
>
> The second part (https://reviews.llvm.org/D113943) is to create an introspection node matcher (named the withTag matcher). This matcher will act as a way to enable verbose logging for matchers, which introduces the possibility for a workflow similar to printf debugging. This matcher is templated - it should accept an inner matcher of any type, and itself be a matcher of that type. For example, if the withTag matcher has an inner matcher of type Matcher<TypeLocMatcher>, it will return a Matcher<TypeLocMatcher>. In terms of functionality, the withTag matcher will not affect the outcome of the inner matcher and should directly return whatever the inner matcher returns.
>
> Before:
>
> varDecl(hasName("x"), hasTypeLoc(typeLoc().bind("TL")))
>
>
> After:
>
> withTag(varDecl(hasName("x"), hasTypeLoc(typeLoc().bind("TL"))))
>
>
> This matcher will be able to provide information such as:
>
> Matcher Type
>
> Matcher Name
>
> Matcher Success
>
> Node Kind
>
> Node Source
>
> The matcher will provide output by writing to stderr.

I'm not opposed to it, but making this user configurable seems like a
better approach (not everyone wants to write to stderr or go through
redirections to get the data into a file). Also, we may want to
consider what formats we want to provide this information. You give an
example below, but this sounds a lot like a case where we might want
to provide structured output (e.g., JSON) so that other tools can
consume it to provide a better user interface. (That seems like
something that can be done in follow-ups rather than something
required for the initial feature.)

> Though the exact format of the output is subject to change, it could look like this, for the matcher withTag(varDecl(withTag(hasName("x")), hasTypeLoc(pointerTypeLoc()))):
>
> ⭐ Attempting new match
>
> Matcher Name: `VarDecl` Node
>
> Node Kind: TypedefDecl
>
> Node Value:
>
> ```
>
> typedef char *__builtin_va_list
>
> ```
>
> Node AST:
>
> TypedefDecl 0x479af10 <<invalid sloc>> <invalid sloc> implicit __builtin_va_list 'char *'
>
> `-PointerType 0x47e6360 'char *'
>
>   `-BuiltinType 0x47e54d0 'char'
>
> Result: Unsuccessful
>
> ✔️ Concluding attempt
>
>
> ⭐ Attempting new match
>
> Matcher Name: `VarDecl` Node
>
> Node Kind: VarDecl
>
> Node Value:
>
> ```
>
> void *x
>
> ```
>
> Node AST:
>
> VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x 'void *'
>
> ⭐ Attempting new match
>
> Matcher Name: HasName
>
> Node Kind: VarDecl
>
> Node Value:
>
> ```
>
> void *x
>
> ```
>
> Node AST:
>
> VarDecl 0x479af80 <input.cc:1:1, col:7> col:7 x 'void *'
>
> Result: Successful
>
> ✔️ Concluding attempt
>
>
> Result: Successful
>
> ✔️ Concluding attempt
>
>
>
> From this output, a user is able to see things such as information about each node the matcher attempts to match, information about which part of the compound matcher is being matched, and the success of each matcher.
>
>
> Pros
>
> The design revolves around a matcher, which is a familiar pattern/construct
>
> The design has low overhead for using and removing - it requires wrapping one or more matchers with the withTag matcher.
>
> The design allows for a great degree of control - one can pick out certain matchers within a compound matcher to analyze.
>
> Cons
>
> One withTag matcher will only be able to report information about the inner matcher, and not necessarily matchers within the inner matcher
>
> However, it may be possible to provide a hook to access the inner matcher’s inner matcher

Another con that I see is: this is likely to get very chatty when
working on complex matchers. We might need some notion of a verbosity
level for the information.

> Printf debugging is just one method of diagnosing issues, and may not be everyone’s preferred approach to debugging. Alternative debugging approaches could include something like steveire’s debugger (https://steveire.wordpress.com/2019/04/16/debugging-clang-ast-matchers/).

FWIW, I loved a lot of the concepts Stephen was developing, so
enabling this kind of technology would be fantastic. One thing I think
is worth considering is that clang-query is currently the go-to tool
for developing complex AST matchers because of the REPL nature of the
tool. While printf-style debugging gets more information, I do wonder
if we should be considering how to integrate this into clang-query
more directly.

> Questions
>
> Currently, getName returns a string. Should we consider returning an enum, instead?

We sometimes rename matchers, so I think using an enumeration may have
some nice (or possibly annoying, depending on perspective) properties.
If we rename something and getName() returns a string, it's more of a
silent change; using an enumeration is more likely to produce a
compile error if someone is relying on the name for some reason. I
don't currently have a strong preference because I can see arguments
either way.

> One reason we have opted for a string, as of now, is because users may define their own private matchers which may not have a value in the enum.

That's a pretty compelling reason to go with the string approach.

> If getName were to return a string, what formats/conventions should we adhere to?
>
> Currently, the returned strings are not named in a very formal approach. For example, most traversal and narrowing matchers have a getName that returns the name of the matcher exactly, whereas most node matchers have a getName that returns a string containing the name of the node (ie. VarDecl instead of varDecl)

Naively, I think users are going to expect getName() to return the
name of the matcher as it is written (that also makes it easier to
visually distinguish between the matcher and the matched node).

~Aaron

>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev