[cfe-dev] Two pass analysis framework: AST merging approach

Manuel Klimek via cfe-dev cfe-dev at lists.llvm.org
Tue Jul 12 05:09:03 PDT 2016


On Tue, Jul 12, 2016 at 1:59 PM Gábor Horváth <xazax.hun at gmail.com> wrote:

> Hi!
>
> (As a ping), I would like to summarize the measurements I done since the
> original e-mail:
>
> The approach is to first serialize all the translation units to the
> storage, create an index of the functions, and then load them lazily on
> demand to achieve cross translation unit support. This does not modify the
> inter-procedural analysis of the Static Analyzer and could be used for
> Clang Tidy as well. Once a new inter-procedural analysis is introduced for
> the Static Analyzer, the cross tu support would profit from it immediately.
>
> Benchmarks:
> rAthena, a 150k LOC C project:
> The size of the serialized ASTs was: 140MB
> The size of the indexes: 4.4MB
> The time of the analysis was bellow 4X
> The amount of memory consumed was bellow 2X
> The number of reports is 3 times more
>
> Xerces, 150k LOC C++ project:
> The size of the serialized ASTs was:  800MB
> The size of the indexes: 90MB
> The analysis time using CTU was the half of the one without CTU
>
> LLVM + Clang + Clang tools extra:
> The size of the serialized ASTs was: 45.4 GB
> The size of the indexes:  1,6GB
>
> Some optimization effort to reduce te size of the CFG:
> TU ASTs after omitting function bodies from headers: 42.7 GB
> TU ASTs after omitting STL: 34.0 GB
> TU ASTs after skipping implicit instantiations: 21.5 GB
> TU ASTs after omitting STL and implicit instantiations: 16.0 GB
>
> Considering that the build directory of a debug build is also about 40 GB
> on my platform, I do not consider the size of the serialized ASTs a
> blocking issue. However, in case it is a concern, according to the attached
> statistics about the serialized AST dumps, there are some optimization
> possibilities in the binary format.
>
> This solution also works in a multi process build/analysis environment.
> Some of the necessary framework, for example ASTImporter code is being
> accepted into mainline clang right now.
>
> All in all, this approach:
> - Can discover new bug reports as is.
> - Feasible to implement, does not need sever modifications to the Static
> Analyzer or Clang Tidy.
> - Has acceptable performance for lots of the real world projects.
>
> I think, this would be a useful addition to the clang source tree. Do you
> agree?
>

I definitely think this is interesting - I'd be curious if we could design
the interfaces in a way that we can
a) fully split out indexing from analysis
b) allow storing the indexing data in a distributed storage system


> Regards,
> Gábor
>
>
> On 4 May 2016 at 15:09, Gábor Horváth <xazax.hun at gmail.com> wrote:
>
>> Hi!
>>
>> This e-mail is a proposal based on the work done by Yury Gibrov et al.:
>> http://lists.llvm.org/pipermail/cfe-dev/2015-December/046299.html
>>
>> They accomplished a two pass analysis, the first pass is serializing the
>> AST of every translation unit and creates an index of functions, the second
>> pass does the real analysis, which can load the AST of function bodies on
>> demand.
>>
>> This approach can be used to achieve cross translation unit analysis for
>> the clang Static Analyzer to some extent, but similar approach could be
>> applicable to Clang Tidy and other clang based tools.
>>
>> While this method is not likely to be a silver bullet for the Static
>> Analyzer, I did some benchmarks to see how feasible this approach is. The
>> baseline was running the Static Analyzer without the two pass analyis, the
>> second one was running using the framework linked above.
>>
>> For a 150k LOC C projects I got the following results:
>> The size of the serialized ASTs was: 140MB
>> The size of the indexes (textual representation): 4.4MB
>> The time of the analysis was bellow 4X
>> The amount of memory consumed was bellow 2X
>>
>> All in all it looks like a feasible approach for some use cases.
>>
>> I also tried to do a benchmark on the LLVM+Clang codebase. Unfortunately
>> I was not able to run the analysis due to some missing features in the AST
>> Importer. But I was able to serialize the ASTs and generate the indices:
>> The siye of the serialized ASTs: 45.4 GB
>> The siye of the function index: 1,6GB
>>
>> While these numbers are less promising, I think there are some
>> opportunities to reduce them significantly.
>>
>> I propose the introduction of an analysis mode for exporting ASTs. In
>> analysis mode the AST exporter would not emit the function body of a
>> function for several cases:
>> - In case a function is defined in a header, do not emit the body.
>> - In case the function was defined in an implicit template
>> specialisation, do not emit the body.
>>
>> I think after similar optimizations it might be feasible to use this
>> approach on LLVM scale projects as well, and it would be much easier to
>> implement Clang based tools that can utilize cross translation unit
>> capabilities.
>>
>> In case the analyzer gets a new interprocedural analysis method that
>> would increase the performance the users of this framework would profit
>> from that approach immediately.
>>
>> Does a framework like this worth mainlining and working on? What do you
>> think?
>>
>> (Note that, AST Importer related improvements are already being mainlined
>> by Yury et al. My question is about the "analysis mode" for exporting ASTs,
>> and a general framework to consume those exported ASTs.)
>>
>> Regards,
>> Gábor
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160712/2c354153/attachment.html>


More information about the cfe-dev mailing list