[cfe-dev] [RFC] Dynamic AST Matcher library

Wed Dec 5 13:59:12 PST 2012

On Wed, Dec 5, 2012 at 3:47 PM, Samuel Benzaquen <sbenza at google.com> wrote:
> There are two use cases for the dynamic matchers:
>  - Faster development time for someone creating matcher expressions for a
> specific purpose.
>  - Generic services that accept arbitrary matchers at runtime. Eg. a
> grep-like tool that uses matchers instead of regex.
>
> The latter was my main motivation for this work.

Ah, ok.

I think that a grep-like tool is a good idea for running on single
translation units (I once mentioned to klimek how I wanted something
like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression>
foo.cc`).

I have some thoughts about the utility of this approach for searching
a full codebase, which I think is a general (albeit long-term-ish)
goal for everyone:

My concern is that a "grep-like" tool does a linear search, which is
going to take so long that I'm not sure that it will really be a huge
benefit to having "dynamic matchers" compared to just writing C++. I
haven't timed "building" with just -fsyntax-only, but I wouldn't
expect it to be faster than 5 minutes on my machine just to parse all
of LLVM/Clang. Larger code bases will be even more problematic.
Obviously if you have 10000 machines to run this on and the
infrastructure to rapidly distribute jobs to them---as you guys do at
Google---then this could be completely parallelized and have the
critical path being the longest translation unit + dispatching
latency; however, I would be surprised if there were more than a
handful of companies (let alone individual developers) with the
infrastructure in place to permit the creation of an
acceptably-performant tool for codebase-wide search with this approach
(AFAIK most distributed processing that companies have in place is
oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum
dispatching latency is acceptable). Even if you have the
infrastructure available, the operational expenses that this would
incur might be prohibitive: nowadays any developer can easily spin up
their own EC2/whatever cloud, but the cost would be quite large to get
enough parallelism (utilization would also be extremely low). So from
my perspective the "grep-like" approach for full codebase search seems
like a piece of the puzzle.

Have you considered looking into a general mechanism for indexing the
AST, so that searches can be performed quickly? That would allow for
the creation of a general-purpose tool which, say, every C/C++
developer could have available to them and could run locally on their
machine. For example, this could be a service provided by clangd, and
the index could be updated as a by-product of the build process (e.g.
CXXFLAGS+='-update-index=~/foo.idx'). The index need not necessarily
contain everything needed for an exact match, but just enough to
greatly cut down the dataset that needs to be "grepped". This is very
similar to Russ Cox's description of how google code search used to
work <http://swtch.com/~rsc/regexp/regexp4.html>, except that instead
of node patterns in the clang AST, they dealt with regexes matching
flat text (it is a fantastic article btw, you should read it if you
haven't). I think that approach could be made to work for the AST.

These are longer-term thoughts but they might be worth keeping in mind.

-- Sean Silva