[cfe-dev] Two pass analysis framework: AST merging approach

Mon Jun 20 01:30:45 PDT 2016

Hi!

I did some additional benchmarks.

On 4 May 2016 at 15:09, Gábor Horváth <xazax.hun at gmail.com> wrote:

> Hi!
>
> This e-mail is a proposal based on the work done by Yury Gibrov et al.:
> http://lists.llvm.org/pipermail/cfe-dev/2015-December/046299.html
>
> They accomplished a two pass analysis, the first pass is serializing the
> AST of every translation unit and creates an index of functions, the second
> pass does the real analysis, which can load the AST of function bodies on
> demand.
>
> This approach can be used to achieve cross translation unit analysis for
> the clang Static Analyzer to some extent, but similar approach could be
> applicable to Clang Tidy and other clang based tools.
>
> While this method is not likely to be a silver bullet for the Static
> Analyzer, I did some benchmarks to see how feasible this approach is. The
> baseline was running the Static Analyzer without the two pass analyis, the
> second one was running using the framework linked above.
>
> For a 150k LOC C projects I got the following results:
> The size of the serialized ASTs was: 140MB
> The size of the indexes (textual representation): 4.4MB
> The time of the analysis was bellow 4X
> The amount of memory consumed was bellow 2X
>

I also tried to use this approach on Xerces XML parsing library, and
surprisingly using the cross translation unit method the analysis was
faster. The size of the AST dumps was about 800 MB.

>
> All in all it looks like a feasible approach for some use cases.
>
> I also tried to do a benchmark on the LLVM+Clang codebase. Unfortunately I
> was not able to run the analysis due to some missing features in the AST
> Importer. But I was able to serialize the ASTs and generate the indices:
> The size of the serialized ASTs: 45.4 GB
> The size of the function index: 1,6GB
>
>

> While these numbers are less promising, I think there are some
> opportunities to reduce them significantly.
>
> I propose the introduction of an analysis mode for exporting ASTs. In
> analysis mode the AST exporter would not emit the function body of a
> function for several cases:
> - In case a function is defined in a header, do not emit the body.
> - In case the function was defined in an implicit template specialisation,
> do not emit the body.
>

Not emitting function bodies at all reduced the size of the AST dumps only
about 10%. I think, however there are some other possibilities to reduce
the size):
* While using modules to reduce the size would be awesome, it is not
feasible to every project. In my opinion it is feasible, however, to use
modules partially! For example the STL and some other libraries that are
known to be well behaved (in a sense they can be built using module
support) can be serialized as one module, and than each translation unit
can be a separate one. This way these libraries' AST are deduplicated, this
can result in massive savings.
* I think it is possible to make the serialized AST representation more
compact. I have attached a JSON statistics, what parts of the AST
contributing the most to the size and how many abbreviations are used. The
number of abbreviations could be increased in some cases. And lots of the
boolean fields are stored in 64 bit integer fields.

What do you think?

> I think after similar optimizations it might be feasible to use this
> approach on LLVM scale projects as well, and it would be much easier to
> implement Clang based tools that can utilize cross translation unit
> capabilities.
>
> In case the analyzer gets a new interprocedural analysis method that would
> increase the performance the users of this framework would profit from that
> approach immediately.
>
> Does a framework like this worth mainlining and working on? What do you
> think?
>
> (Note that, AST Importer related improvements are already being mainlined
> by Yury et al. My question is about the "analysis mode" for exporting ASTs,
> and a general framework to consume those exported ASTs.)
>
> Regards,
> Gábor
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160620/2f1a89a6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: statistics.json
Type: application/json
Size: 34207 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160620/2f1a89a6/attachment.json>