<div style="font-family: arial, helvetica, sans-serif; font-size: 10pt"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Dec 5, 2012 at 4:59 PM, Sean Silva <span dir="ltr"><<a href="mailto:silvas@purdue.edu" target="_blank" class="cremed">silvas@purdue.edu</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="im">On Wed, Dec 5, 2012 at 3:47 PM, Samuel Benzaquen <<a href="mailto:sbenza@google.com" class="cremed">sbenza@google.com</a>> wrote:<br>


> There are two use cases for the dynamic matchers:<br>

>  - Faster development time for someone creating matcher expressions for a<br>

> specific purpose.<br>

>  - Generic services that accept arbitrary matchers at runtime. Eg. a<br>

> grep-like tool that uses matchers instead of regex.<br>

><br>

> The latter was my main motivation for this work.<br>

<br>

</div>Ah, ok.<br>

<br>

I think that a grep-like tool is a good idea for running on single<br>

translation units (I once mentioned to klimek how I wanted something<br>

like `clang -ast-print -ast-match=<arbitrary ASTMatcher expression><br>

foo.cc`).<br>

<br>

I have some thoughts about the utility of this approach for searching<br>

a full codebase, which I think is a general (albeit long-term-ish)<br>

goal for everyone:<br>

<br>

My concern is that a "grep-like" tool does a linear search, which is<br>

going to take so long that I'm not sure that it will really be a huge<br>

benefit to having "dynamic matchers" compared to just writing C++.</blockquote><div><br></div><div style>It simplifies the usage of matchers.</div><div style>Writing/compiling/running a matcher directly in C++ takes way longer than just running a prebuilt binary with a string argument.</div>

<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">I haven't timed "building" with just -fsyntax-only, but I wouldn't<br>


expect it to be faster than 5 minutes on my machine just to parse all<br>

of LLVM/Clang. Larger code bases will be even more problematic.<br></blockquote><div><br></div><div>From what I saw, parsing and creating the AST is way more expensive than running the matchers.</div><div>The extra overhead of the dynamic dispatched layer does not seem to have any effect on the cost of matching.</div>

<div style>The AST could be cached in a serialized form to reduce the effect of parsing the code.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


Obviously if you have 10000 machines to run this on and the<br>

infrastructure to rapidly distribute jobs to them---as you guys do at<br>

Google---then this could be completely parallelized and have the<br>

critical path being the longest translation unit + dispatching<br>

latency; however, I would be surprised if there were more than a<br>

handful of companies (let alone individual developers) with the<br>

infrastructure in place to permit the creation of an<br>

acceptably-performant tool for codebase-wide search with this approach<br>

(AFAIK most distributed processing that companies have in place is<br>

oriented at batch jobs where nontrivial (e.g. 10 sec, 1 min) minimum<br>

dispatching latency is acceptable). Even if you have the<br>

infrastructure available, the operational expenses that this would<br>

incur might be prohibitive: nowadays any developer can easily spin up<br>

their own EC2/whatever cloud, but the cost would be quite large to get<br>

enough parallelism (utilization would also be extremely low). So from<br>

my perspective the "grep-like" approach for full codebase search seems<br>

like a piece of the puzzle.<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br>

Have you considered looking into a general mechanism for indexing the<br>

AST, so that searches can be performed quickly?</blockquote><div><br></div><div style>I have given it some thought.</div><div style>For example, on a matcher like:<br></div><div style><br><div><font face="courier new, monospace">memberCallExpr(thisPointerType(recordDecl(hasName(class_name_)))),</font></div>

<div><font face="courier new, monospace">               callExpr(callee(methodDecl(hasName(call_name_)))))</font></div><br></div><div style>using an index for hasName() would be easy.</div><div style>Just filtering out compilation units that do not know about a class called class_name_ can help in most cases.</div>

<div style>Indexing common sub-expressions like <span style="font-family:'courier new',monospace">memberCallExpr(thisPointerType(recordDecl(hasName(class_name_))))) </span>could also have some benefit.</div><div style>

I don't think you could have an index that benefits all matchers, but the most common ones could be taken care of.</div><div style><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

That would allow for<br>

the creation of a general-purpose tool which, say, every C/C++<br>

developer could have available to them and could run locally on their<br>

machine. For example, this could be a service provided by clangd, and<br>

the index could be updated as a by-product of the build process (e.g.<br>

CXXFLAGS+='-update-index=~/foo.idx'). The index need not necessarily<br>

contain everything needed for an exact match, but just enough to<br>

greatly cut down the dataset that needs to be "grepped". This is very<br>

similar to Russ Cox's description of how google code search used to<br>

work <<a href="http://swtch.com/~rsc/regexp/regexp4.html" target="_blank" class="cremed">http://swtch.com/~rsc/regexp/regexp4.html</a>>, except that instead<br>

of node patterns in the clang AST, they dealt with regexes matching<br>

flat text (it is a fantastic article btw, you should read it if you<br>

haven't). I think that approach could be made to work for the AST.<br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<br>

These are longer-term thoughts but they might be worth keeping in mind.<br>

<span class=""><font color="#888888"><br>

-- Sean Silva<br>

</font></span></blockquote></div><br></div></div></div>