[cfe-dev] Patch: AST Matcher Framwork and an example tool

Tue Jun 14 23:21:24 PDT 2011

On Tue, Jun 14, 2011 at 10:46 PM, Chris Lattner <clattner at apple.com> wrote:

> On Jun 14, 2011, at 10:10 PM, Manuel Klimek wrote:
>
> The library we implemented based on those findings allows a very
>> declarative way of expressing interesting parts of the AST that you want to
>> work on, with a callback mechanism to have an imperative part that does the
>> actual transformation. The way this library uses templates to build up a
>> reverse hierarchy of the clang AST allows us to expand the library of
>> matchers without needing to manually copy the structure of the AST while
>> still providing full compile time type checking.
>> We have used this library for many internal transformations (one more
>> broadly applicable of them being included as an example tool), and Nico may
>> be able to explain more how he's using the infrastructure for Chromium.
>>
>>
>> My primary concern with this work is that it is built as a series of
>> *compile-time* code constructs.  This means that someone working on a
>> rewriter needs to rebuild the clang executable to do this.  Instead of doing
>> this as a compile-time construct, have you considered building this as an
>> extensible pattern matching tool where the pattern and replacement can be
>> specified at runtime?  I'm envisioning something like a "regular expression
>> for an AST".  You don't need to rebuild grep every time you want to search
>> for something new in text.
>>
>> Even in the case where compiled-in code is required, having a more dynamic
>> matcher would greatly simplify the code doing the matching.  Have you
>> considered this approach?
>>
>
> Yes, we've considered this approach early on. We looked into both some Java
>  and C-based solutions (see for example http://nighthacks.com/roller/jag/;
> for Java there are really really bad examples that match Java to XML and do
> xpath queries). The problem is that building that pattern matcher language
> would not be straight-forward (simply writing C++ with a few globs would not
> be enough for our current use cases, since C++ has a lot of implicit stuff
> going on which we want to match, and just creating an arbitrary new language
> doesn't necessarily seem better than the in-language DSL).
>
>
> I'm not suggesting that you parse "C++ with holes in it" as the pattern
> matching language.  It would be perfectly acceptable to represent the
> patterns as S expressions for example.  You currently use this sort of thing
> at compile time:
>
> ConstructorCall(HasDeclaration(Method(HasName(StringConstructor))),
>                 ArgumentCountIs(2),...
>
>
> This is basically already s expressions, you could either use this sort of
> thing unmodified if you think it is nice looking, or convert to proper s
> exprs, as in:
>
> (constructorcall (hasdeclaration (method (hasname stringconstructor)))
>                              (argumentcountis 2) ...
>
>
> The point is to make the matching language have really simple and trivial
> syntax, but syntax that is usable without rebuilding the compiler.  There
> are tons of examples of tree pattern matches (including things like Burg) to
> draw inspiration from.
>

> Considering that we want to eventually get a dynamic pattern matching
> language, but we also want to get it right, we are currently spending our
> time on the in-language DSL, and especially for the large scale stuff the
> developers we work with need surprisingly little help (the included example
> for replacing reduncant c_str() calls was created by a contributor who's not
> worked on the implementation of the matcher library).
>
>
> I'm not sure I get this logic.  You're saying that you're afraid you won't
> get the matching language right, so you'd avoiding it and doing something
> you know is wrong ;-).  I expect much iteration on this, but all that
> requires is to tell people to expect breakage as you get experience with it
> and evolve things.
>

I don't think our current code is "wrong". I think it is useful for a
certain amount of tools, and as the basis of more dynamic tools. The main
problem with building a dynamic language is that it's significantly more
effort, as it will require to be a lot more complete before being useful. In
the C++ world the user can always fall back to using the AST methods
available.

>
> And when we get to the higher-level refactoring tools, the dynamic aspect
> will be parameters to the refactoring, so the non-dynamic nature of the AST
> matchers does not matter for that case.
>
>
> Well, the second step is to be able to specify rewrites dynamically the
> same way you specify predicates.  In the case of a "real" refactoring
> engine, you'll probably want the power of a scripting language or something
> to write your predicates in.
>

Again, a lot more effort, and I think we can base a more dynamic solution on
what we currently develop.

When we look at the actual transformations, being in C++ again provides the
> benefit that we can just work with the AST nodes we matched instead of
> having to define some new way of dynamically specifying the transformations
> - and re-binding the AST in a dynamic language is definitely out-of-scope
> for us...
>
>
> I see the convenience in this, but still think it is the wrong way to go.
>
> In the end, I agree that the vision to have a really nice dynamic
> description of the matches is the ultimate goal, but for us this is
> currently still a few quarters out. The C++ code provides really useful
> abstractions to quickly describe matches and transformations on the AST,
> with little code (as we can use C++ to provide the type safety and thus the
> error checking on the AST nodes). The cost of a link-step while writing the
> tools has so far not been a big obstacle, especially considering that our
> main target users currently are a) C++ experts doing large scale code
> transformations and b) writing refactoring tools that end users can use
> without any knowledge about the AST.
>
>
> Beyond being "the wrong way to go" IMHO, there are several other problems
> with the code as proposed:
>
> 1. It doesn't following the LLVM coding standards, particularly around
> naming, using std::map<std::string, using C headers like <assert.h>, and a
> bunch of other stuff.
>

I tried to make it llvm coding style conforming, and had Chandler and
Zhanyong review it - I'd be happy to change whatever we missed.

>
> 2. You're building substantial new functionality into clang.  The clang
> binary is already overly bloated with the static analyzer and other
> functionality that it keeps accreting .  It would be better to use (and
> improve) the clang plugin mechanism to build this as a plugin.  I'd also
> like the analyzer to move off to being a plugin as well.  One carrot that we
> can give for people to build things as plugins is that they can use C++'0x
> features even though the clang compiler has to stay C++'98 for the
> forseeable future.
>

The point why I don't want to run tools as a plugin is that the command line
syntax for tools can be significantly different from running the compiler. I
don't think this needs to be linked into the clang binary though - as
nothing in clang depends on it (confused...)

>
> 3. The tooling infrastructure adds python stuff to do the rewrites.  This
> seems pretty half-baked to me.  If the whole reason to compile stuff in is
> to make things simpler, why do we need external scripts?
>

Yes, this is an intermediate step that will not be necessary soon. This is
work in progress...

>
> 4. Building this as compile-time stuff requires things
> like VariadicFunctions.h, which (if generally useful) should be in LLVM, not
> clang.  It is better to define away the problem though by not doing this
> stuff at compile time.
>
> 5. Adding major new stuff like JSON parsing, etc. All of these (if they
> even make sense) should be independently reviewed and submitted, not taking
> as one mega patch.
>

This was actually going in as multiple smaller patches. I just deleted
everything at once, as it was all submitted with Doug saying that we should
go ahead and he'll review post-submit, but after you intervened I thought it
doesn't make sense to leave half of the changes in. Then I re-diffed the
whole thing and put it in as one giant patch that would revert the delete.

> Overall, this is exactly the sort of thing that happens when someone
> develops a large amount of code out of tree, without input from other
> contributors, and then tries to spring it on an open source project.  While
> I really laud your goals and really want to push refactoring forward, this
> is not the right direction to start from. Trying to push a huge patch in
> isn't the way to get to something that is truly great in the mainline tree.
>

I certainly don't want to "spring it" on clang - and I'm sorry if I gave
this impression. I really tried to make sure that each step I did was ok,
trying to split it up into manageable pieces, confirming each check in with
Doug, and trying to get feedback - again, I'm sorry if that was not the
right approach. If you don't think this is useful to host in clang, because
you think it's not the right approach, than this is completely your call.

I definitely think what we have is useful for developers who use clang, and
as such I think the clang code base would be a good place to host it,
especially since I'd really appreciate feedback from you guys while
developing the code. But if you decide that we should host this code
elsewhere I'm fine with that, too.

Cheers,
/Manuel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20110614/31441be5/attachment.html>