[cfe-dev] Patch: AST Matcher Framwork and an example tool

Tue Jun 14 22:46:46 PDT 2011

On Jun 14, 2011, at 10:10 PM, Manuel Klimek wrote:
>> The library we implemented based on those findings allows a very declarative way of expressing interesting parts of the AST that you want to work on, with a callback mechanism to have an imperative part that does the actual transformation. The way this library uses templates to build up a reverse hierarchy of the clang AST allows us to expand the library of matchers without needing to manually copy the structure of the AST while still providing full compile time type checking.
>> We have used this library for many internal transformations (one more broadly applicable of them being included as an example tool), and Nico may be able to explain more how he's using the infrastructure for Chromium.
> 
> My primary concern with this work is that it is built as a series of *compile-time* code constructs.  This means that someone working on a rewriter needs to rebuild the clang executable to do this.  Instead of doing this as a compile-time construct, have you considered building this as an extensible pattern matching tool where the pattern and replacement can be specified at runtime?  I'm envisioning something like a "regular expression for an AST".  You don't need to rebuild grep every time you want to search for something new in text.
> 
> Even in the case where compiled-in code is required, having a more dynamic matcher would greatly simplify the code doing the matching.  Have you considered this approach?
> 
> Yes, we've considered this approach early on. We looked into both some Java  and C-based solutions (see for example http://nighthacks.com/roller/jag/; for Java there are really really bad examples that match Java to XML and do xpath queries). The problem is that building that pattern matcher language would not be straight-forward (simply writing C++ with a few globs would not be enough for our current use cases, since C++ has a lot of implicit stuff going on which we want to match, and just creating an arbitrary new language doesn't necessarily seem better than the in-language DSL).

I'm not suggesting that you parse "C++ with holes in it" as the pattern matching language.  It would be perfectly acceptable to represent the patterns as S expressions for example.  You currently use this sort of thing at compile time:

ConstructorCall(HasDeclaration(Method(HasName(StringConstructor))),
                ArgumentCountIs(2),...

This is basically already s expressions, you could either use this sort of thing unmodified if you think it is nice looking, or convert to proper s exprs, as in:

(constructorcall (hasdeclaration (method (hasname stringconstructor)))
                             (argumentcountis 2) ...

The point is to make the matching language have really simple and trivial syntax, but syntax that is usable without rebuilding the compiler.  There are tons of examples of tree pattern matches (including things like Burg) to draw inspiration from.

> Considering that we want to eventually get a dynamic pattern matching language, but we also want to get it right, we are currently spending our time on the in-language DSL, and especially for the large scale stuff the developers we work with need surprisingly little help (the included example for replacing reduncant c_str() calls was created by a contributor who's not worked on the implementation of the matcher library).

I'm not sure I get this logic.  You're saying that you're afraid you won't get the matching language right, so you'd avoiding it and doing something you know is wrong ;-).  I expect much iteration on this, but all that requires is to tell people to expect breakage as you get experience with it and evolve things.

> And when we get to the higher-level refactoring tools, the dynamic aspect will be parameters to the refactoring, so the non-dynamic nature of the AST matchers does not matter for that case.

Well, the second step is to be able to specify rewrites dynamically the same way you specify predicates.  In the case of a "real" refactoring engine, you'll probably want the power of a scripting language or something to write your predicates in.

> When we look at the actual transformations, being in C++ again provides the benefit that we can just work with the AST nodes we matched instead of having to define some new way of dynamically specifying the transformations - and re-binding the AST in a dynamic language is definitely out-of-scope for us...

I see the convenience in this, but still think it is the wrong way to go.

> In the end, I agree that the vision to have a really nice dynamic description of the matches is the ultimate goal, but for us this is currently still a few quarters out. The C++ code provides really useful abstractions to quickly describe matches and transformations on the AST, with little code (as we can use C++ to provide the type safety and thus the error checking on the AST nodes). The cost of a link-step while writing the tools has so far not been a big obstacle, especially considering that our main target users currently are a) C++ experts doing large scale code transformations and b) writing refactoring tools that end users can use without any knowledge about the AST.

Beyond being "the wrong way to go" IMHO, there are several other problems with the code as proposed:

1. It doesn't following the LLVM coding standards, particularly around naming, using std::map<std::string, using C headers like <assert.h>, and a bunch of other stuff.

2. You're building substantial new functionality into clang.  The clang binary is already overly bloated with the static analyzer and other functionality that it keeps accreting .  It would be better to use (and improve) the clang plugin mechanism to build this as a plugin.  I'd also like the analyzer to move off to being a plugin as well.  One carrot that we can give for people to build things as plugins is that they can use C++'0x features even though the clang compiler has to stay C++'98 for the forseeable future. 

3. The tooling infrastructure adds python stuff to do the rewrites.  This seems pretty half-baked to me.  If the whole reason to compile stuff in is to make things simpler, why do we need external scripts?

4. Building this as compile-time stuff requires things like VariadicFunctions.h, which (if generally useful) should be in LLVM, not clang.  It is better to define away the problem though by not doing this stuff at compile time.

5. Adding major new stuff like JSON parsing, etc. All of these (if they even make sense) should be independently reviewed and submitted, not taking as one mega patch.

Overall, this is exactly the sort of thing that happens when someone develops a large amount of code out of tree, without input from other contributors, and then tries to spring it on an open source project.  While I really laud your goals and really want to push refactoring forward, this is not the right direction to start from. Trying to push a huge patch in isn't the way to get to something that is truly great in the mainline tree.

-Chris

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110614/822457bd/attachment.html>