[cfe-dev] Two pass analysis framework: AST merging approach

Gábor Horváth via cfe-dev cfe-dev at lists.llvm.org
Tue Jul 12 04:59:50 PDT 2016


Hi!

(As a ping), I would like to summarize the measurements I done since the
original e-mail:

The approach is to first serialize all the translation units to the
storage, create an index of the functions, and then load them lazily on
demand to achieve cross translation unit support. This does not modify the
inter-procedural analysis of the Static Analyzer and could be used for
Clang Tidy as well. Once a new inter-procedural analysis is introduced for
the Static Analyzer, the cross tu support would profit from it immediately.

Benchmarks:
rAthena, a 150k LOC C project:
The size of the serialized ASTs was: 140MB
The size of the indexes: 4.4MB
The time of the analysis was bellow 4X
The amount of memory consumed was bellow 2X
The number of reports is 3 times more

Xerces, 150k LOC C++ project:
The size of the serialized ASTs was:  800MB
The size of the indexes: 90MB
The analysis time using CTU was the half of the one without CTU

LLVM + Clang + Clang tools extra:
The size of the serialized ASTs was: 45.4 GB
The size of the indexes:  1,6GB

Some optimization effort to reduce te size of the CFG:
TU ASTs after omitting function bodies from headers: 42.7 GB
TU ASTs after omitting STL: 34.0 GB
TU ASTs after skipping implicit instantiations: 21.5 GB
TU ASTs after omitting STL and implicit instantiations: 16.0 GB

Considering that the build directory of a debug build is also about 40 GB
on my platform, I do not consider the size of the serialized ASTs a
blocking issue. However, in case it is a concern, according to the attached
statistics about the serialized AST dumps, there are some optimization
possibilities in the binary format.

This solution also works in a multi process build/analysis environment.
Some of the necessary framework, for example ASTImporter code is being
accepted into mainline clang right now.

All in all, this approach:
- Can discover new bug reports as is.
- Feasible to implement, does not need sever modifications to the Static
Analyzer or Clang Tidy.
- Has acceptable performance for lots of the real world projects.

I think, this would be a useful addition to the clang source tree. Do you
agree?

Regards,
Gábor


On 4 May 2016 at 15:09, Gábor Horváth <xazax.hun at gmail.com> wrote:

> Hi!
>
> This e-mail is a proposal based on the work done by Yury Gibrov et al.:
> http://lists.llvm.org/pipermail/cfe-dev/2015-December/046299.html
>
> They accomplished a two pass analysis, the first pass is serializing the
> AST of every translation unit and creates an index of functions, the second
> pass does the real analysis, which can load the AST of function bodies on
> demand.
>
> This approach can be used to achieve cross translation unit analysis for
> the clang Static Analyzer to some extent, but similar approach could be
> applicable to Clang Tidy and other clang based tools.
>
> While this method is not likely to be a silver bullet for the Static
> Analyzer, I did some benchmarks to see how feasible this approach is. The
> baseline was running the Static Analyzer without the two pass analyis, the
> second one was running using the framework linked above.
>
> For a 150k LOC C projects I got the following results:
> The size of the serialized ASTs was: 140MB
> The size of the indexes (textual representation): 4.4MB
> The time of the analysis was bellow 4X
> The amount of memory consumed was bellow 2X
>
> All in all it looks like a feasible approach for some use cases.
>
> I also tried to do a benchmark on the LLVM+Clang codebase. Unfortunately I
> was not able to run the analysis due to some missing features in the AST
> Importer. But I was able to serialize the ASTs and generate the indices:
> The siye of the serialized ASTs: 45.4 GB
> The siye of the function index: 1,6GB
>
> While these numbers are less promising, I think there are some
> opportunities to reduce them significantly.
>
> I propose the introduction of an analysis mode for exporting ASTs. In
> analysis mode the AST exporter would not emit the function body of a
> function for several cases:
> - In case a function is defined in a header, do not emit the body.
> - In case the function was defined in an implicit template specialisation,
> do not emit the body.
>
> I think after similar optimizations it might be feasible to use this
> approach on LLVM scale projects as well, and it would be much easier to
> implement Clang based tools that can utilize cross translation unit
> capabilities.
>
> In case the analyzer gets a new interprocedural analysis method that would
> increase the performance the users of this framework would profit from that
> approach immediately.
>
> Does a framework like this worth mainlining and working on? What do you
> think?
>
> (Note that, AST Importer related improvements are already being mainlined
> by Yury et al. My question is about the "analysis mode" for exporting ASTs,
> and a general framework to consume those exported ASTs.)
>
> Regards,
> Gábor
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160712/51f44e64/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: statistics.json
Type: application/json
Size: 34207 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160712/51f44e64/attachment.json>


More information about the cfe-dev mailing list