[cfe-dev] [RFC] Dynamic AST Matcher library

Wed Dec 5 15:35:53 PST 2012

On Wed, Dec 5, 2012 at 4:59 PM, Sean Silva <silvas at purdue.edu> wrote:

> On Wed, Dec 5, 2012 at 3:47 PM, Samuel Benzaquen <sbenza at google.com>
> wrote:
> > There are two use cases for the dynamic matchers:
> >  - Faster development time for someone creating matcher expressions for a
> > specific purpose.
> >  - Generic services that accept arbitrary matchers at runtime. Eg. a
> > grep-like tool that uses matchers instead of regex.
> >
> > The latter was my main motivation for this work.
>
> Ah, ok.
>
> I think that a grep-like tool is a good idea for running on single
> translation units (I once mentioned to klimek how I wanted something
> like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression>
> foo.cc`).
>
> I have some thoughts about the utility of this approach for searching
> a full codebase, which I think is a general (albeit long-term-ish)
> goal for everyone:
>
> My concern is that a "grep-like" tool does a linear search, which is
> going to take so long that I'm not sure that it will really be a huge
> benefit to having "dynamic matchers" compared to just writing C++.

It simplifies the usage of matchers.
Writing/compiling/running a matcher directly in C++ takes way longer than
just running a prebuilt binary with a string argument.

I haven't timed "building" with just -fsyntax-only, but I wouldn't
> expect it to be faster than 5 minutes on my machine just to parse all
> of LLVM/Clang. Larger code bases will be even more problematic.
>

>From what I saw, parsing and creating the AST is way more expensive than
running the matchers.
The extra overhead of the dynamic dispatched layer does not seem to have
any effect on the cost of matching.
The AST could be cached in a serialized form to reduce the effect of
parsing the code.

> Obviously if you have 10000 machines to run this on and the
> infrastructure to rapidly distribute jobs to them---as you guys do at
> Google---then this could be completely parallelized and have the
> critical path being the longest translation unit + dispatching
> latency; however, I would be surprised if there were more than a
> handful of companies (let alone individual developers) with the
> infrastructure in place to permit the creation of an
> acceptably-performant tool for codebase-wide search with this approach
> (AFAIK most distributed processing that companies have in place is
> oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum
> dispatching latency is acceptable). Even if you have the
> infrastructure available, the operational expenses that this would
> incur might be prohibitive: nowadays any developer can easily spin up
> their own EC2/whatever cloud, but the cost would be quite large to get
> enough parallelism (utilization would also be extremely low). So from
> my perspective the "grep-like" approach for full codebase search seems
> like a piece of the puzzle.
>

> Have you considered looking into a general mechanism for indexing the
> AST, so that searches can be performed quickly?

I have given it some thought.
For example, on a matcher like:

memberCallExpr(thisPointerType(recordDecl(hasName(class_name_)))),
               callExpr(callee(methodDecl(hasName(call_name_)))))

using an index for hasName() would be easy.
Just filtering out compilation units that do not know about a class called
class_name_ can help in most cases.
Indexing common sub-expressions like
memberCallExpr(thisPointerType(recordDecl(hasName(class_name_))))) could
also have some benefit.
I don't think you could have an index that benefits all matchers, but the
most common ones could be taken care of.

That would allow for
> the creation of a general-purpose tool which, say, every C/C++
> developer could have available to them and could run locally on their
> machine. For example, this could be a service provided by clangd, and
> the index could be updated as a by-product of the build process (e.g.
> CXXFLAGS+='-update-index=~/foo.idx'). The index need not necessarily
> contain everything needed for an exact match, but just enough to
> greatly cut down the dataset that needs to be "grepped". This is very
> similar to Russ Cox's description of how google code search used to
> work <http://swtch.com/~rsc/regexp/regexp4.html>, except that instead
> of node patterns in the clang AST, they dealt with regexes matching
> flat text (it is a fantastic article btw, you should read it if you
> haven't). I think that approach could be made to work for the AST.
>

> These are longer-term thoughts but they might be worth keeping in mind.
>
> -- Sean Silva
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20121205/70773de3/attachment.html>