[cfe-dev] [RFC] Dynamic AST Matcher library

Thu Dec 6 01:54:31 PST 2012

As additional context, note that:
- we can run servers that have all of the source code pre-parsed in memory;
being able to run multiple matchers in parallel over a pre-built AST is
thus possible as long as we don't need to recompile
- jit-ing C++ code inside those servers is both not simple (yay C++) and
has security implications that not everybody might be happy with
- the proposed "language" is something-like-s-expressions with some
constants on top of it; that's hardly C++

Cheers,
/Manuel

On Thu, Dec 6, 2012 at 12:35 AM, Samuel Benzaquen <sbenza at google.com> wrote:

>
>
>
> On Wed, Dec 5, 2012 at 4:59 PM, Sean Silva <silvas at purdue.edu> wrote:
>
>> On Wed, Dec 5, 2012 at 3:47 PM, Samuel Benzaquen <sbenza at google.com>
>> wrote:
>> > There are two use cases for the dynamic matchers:
>> >  - Faster development time for someone creating matcher expressions for
>> a
>> > specific purpose.
>> >  - Generic services that accept arbitrary matchers at runtime. Eg. a
>> > grep-like tool that uses matchers instead of regex.
>> >
>> > The latter was my main motivation for this work.
>>
>> Ah, ok.
>>
>> I think that a grep-like tool is a good idea for running on single
>> translation units (I once mentioned to klimek how I wanted something
>> like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression>
>> foo.cc`).
>>
>> I have some thoughts about the utility of this approach for searching
>> a full codebase, which I think is a general (albeit long-term-ish)
>> goal for everyone:
>>
>> My concern is that a "grep-like" tool does a linear search, which is
>> going to take so long that I'm not sure that it will really be a huge
>> benefit to having "dynamic matchers" compared to just writing C++.
>
>
> It simplifies the usage of matchers.
> Writing/compiling/running a matcher directly in C++ takes way longer than
> just running a prebuilt binary with a string argument.
>
> I haven't timed "building" with just -fsyntax-only, but I wouldn't
>> expect it to be faster than 5 minutes on my machine just to parse all
>> of LLVM/Clang. Larger code bases will be even more problematic.
>>
>
> From what I saw, parsing and creating the AST is way more expensive than
> running the matchers.
> The extra overhead of the dynamic dispatched layer does not seem to have
> any effect on the cost of matching.
> The AST could be cached in a serialized form to reduce the effect of
> parsing the code.
>
>
>> Obviously if you have 10000 machines to run this on and the
>> infrastructure to rapidly distribute jobs to them---as you guys do at
>> Google---then this could be completely parallelized and have the
>> critical path being the longest translation unit + dispatching
>> latency; however, I would be surprised if there were more than a
>> handful of companies (let alone individual developers) with the
>> infrastructure in place to permit the creation of an
>> acceptably-performant tool for codebase-wide search with this approach
>> (AFAIK most distributed processing that companies have in place is
>> oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum
>> dispatching latency is acceptable). Even if you have the
>> infrastructure available, the operational expenses that this would
>> incur might be prohibitive: nowadays any developer can easily spin up
>> their own EC2/whatever cloud, but the cost would be quite large to get
>> enough parallelism (utilization would also be extremely low). So from
>> my perspective the "grep-like" approach for full codebase search seems
>> like a piece of the puzzle.
>>
>
>> Have you considered looking into a general mechanism for indexing the
>> AST, so that searches can be performed quickly?
>
>
> I have given it some thought.
> For example, on a matcher like:
>
> memberCallExpr(thisPointerType(recordDecl(hasName(class_name_)))),
>                callExpr(callee(methodDecl(hasName(call_name_)))))
>
> using an index for hasName() would be easy.
> Just filtering out compilation units that do not know about a class called
> class_name_ can help in most cases.
> Indexing common sub-expressions like
> memberCallExpr(thisPointerType(recordDecl(hasName(class_name_))))) could
> also have some benefit.
> I don't think you could have an index that benefits all matchers, but the
> most common ones could be taken care of.
>
> That would allow for
>> the creation of a general-purpose tool which, say, every C/C++
>> developer could have available to them and could run locally on their
>> machine. For example, this could be a service provided by clangd, and
>> the index could be updated as a by-product of the build process (e.g.
>> CXXFLAGS+='-update-index=~/foo.idx'). The index need not necessarily
>> contain everything needed for an exact match, but just enough to
>> greatly cut down the dataset that needs to be "grepped". This is very
>> similar to Russ Cox's description of how google code search used to
>> work <http://swtch.com/~rsc/regexp/regexp4.html>, except that instead
>> of node patterns in the clang AST, they dealt with regexes matching
>> flat text (it is a fantastic article btw, you should read it if you
>> haven't). I think that approach could be made to work for the AST.
>>
>
>> These are longer-term thoughts but they might be worth keeping in mind.
>>
>> -- Sean Silva
>>
>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20121206/4c1d8e7e/attachment.html>