[cfe-dev] Two pass analysis framework: AST merging approach

Tue Jul 12 05:45:03 PDT 2016

On Tue, Jul 12, 2016 at 2:36 PM Gábor Horváth <xazax.hun at gmail.com> wrote:

> On 12 July 2016 at 14:09, Manuel Klimek <klimek at google.com> wrote:
>
>> On Tue, Jul 12, 2016 at 1:59 PM Gábor Horváth <xazax.hun at gmail.com>
>> wrote:
>>
>>> Hi!
>>>
>>> (As a ping), I would like to summarize the measurements I done since the
>>> original e-mail:
>>>
>>> The approach is to first serialize all the translation units to the
>>> storage, create an index of the functions, and then load them lazily on
>>> demand to achieve cross translation unit support. This does not modify the
>>> inter-procedural analysis of the Static Analyzer and could be used for
>>> Clang Tidy as well. Once a new inter-procedural analysis is introduced for
>>> the Static Analyzer, the cross tu support would profit from it immediately.
>>>
>>> Benchmarks:
>>> rAthena, a 150k LOC C project:
>>> The size of the serialized ASTs was: 140MB
>>> The size of the indexes: 4.4MB
>>> The time of the analysis was bellow 4X
>>> The amount of memory consumed was bellow 2X
>>> The number of reports is 3 times more
>>>
>>> Xerces, 150k LOC C++ project:
>>> The size of the serialized ASTs was:  800MB
>>> The size of the indexes: 90MB
>>> The analysis time using CTU was the half of the one without CTU
>>>
>>> LLVM + Clang + Clang tools extra:
>>> The size of the serialized ASTs was: 45.4 GB
>>> The size of the indexes:  1,6GB
>>>
>>> Some optimization effort to reduce te size of the CFG:
>>> TU ASTs after omitting function bodies from headers: 42.7 GB
>>> TU ASTs after omitting STL: 34.0 GB
>>> TU ASTs after skipping implicit instantiations: 21.5 GB
>>> TU ASTs after omitting STL and implicit instantiations: 16.0 GB
>>>
>>> Considering that the build directory of a debug build is also about 40
>>> GB on my platform, I do not consider the size of the serialized ASTs a
>>> blocking issue. However, in case it is a concern, according to the attached
>>> statistics about the serialized AST dumps, there are some optimization
>>> possibilities in the binary format.
>>>
>>> This solution also works in a multi process build/analysis environment.
>>> Some of the necessary framework, for example ASTImporter code is being
>>> accepted into mainline clang right now.
>>>
>>> All in all, this approach:
>>> - Can discover new bug reports as is.
>>> - Feasible to implement, does not need sever modifications to the Static
>>> Analyzer or Clang Tidy.
>>> - Has acceptable performance for lots of the real world projects.
>>>
>>> I think, this would be a useful addition to the clang source tree. Do
>>> you agree?
>>>
>>
>> I definitely think this is interesting - I'd be curious if we could
>> design the interfaces in a way that we can
>> a) fully split out indexing from analysis
>>
>
> In the current implementation the analysis and dumping and indexing ASTs
> are two separate passes, so they can be used completely independently.
> Which is a very good design choice in my opinion.
>
>
>> b) allow storing the indexing data in a distributed storage system
>>
>
> I do not have much experience with distributed storage systems, but I
> think that should work. The only problem might be the size of the
> individual AST dumps.
>

Yea, I think my main point is that I'd want the interface on the clang side
to not be coupled to things being on a file system (basically have an
interface for loading the indexed files).

>
>
>>
>>
>>> Regards,
>>> Gábor
>>>
>>>
>>> On 4 May 2016 at 15:09, Gábor Horváth <xazax.hun at gmail.com> wrote:
>>>
>>>> Hi!
>>>>
>>>> This e-mail is a proposal based on the work done by Yury Gibrov et al.:
>>>> http://lists.llvm.org/pipermail/cfe-dev/2015-December/046299.html
>>>>
>>>> They accomplished a two pass analysis, the first pass is serializing
>>>> the AST of every translation unit and creates an index of functions, the
>>>> second pass does the real analysis, which can load the AST of function
>>>> bodies on demand.
>>>>
>>>> This approach can be used to achieve cross translation unit analysis
>>>> for the clang Static Analyzer to some extent, but similar approach could be
>>>> applicable to Clang Tidy and other clang based tools.
>>>>
>>>> While this method is not likely to be a silver bullet for the Static
>>>> Analyzer, I did some benchmarks to see how feasible this approach is. The
>>>> baseline was running the Static Analyzer without the two pass analyis, the
>>>> second one was running using the framework linked above.
>>>>
>>>> For a 150k LOC C projects I got the following results:
>>>> The size of the serialized ASTs was: 140MB
>>>> The size of the indexes (textual representation): 4.4MB
>>>> The time of the analysis was bellow 4X
>>>> The amount of memory consumed was bellow 2X
>>>>
>>>> All in all it looks like a feasible approach for some use cases.
>>>>
>>>> I also tried to do a benchmark on the LLVM+Clang codebase.
>>>> Unfortunately I was not able to run the analysis due to some missing
>>>> features in the AST Importer. But I was able to serialize the ASTs and
>>>> generate the indices:
>>>> The siye of the serialized ASTs: 45.4 GB
>>>> The siye of the function index: 1,6GB
>>>>
>>>> While these numbers are less promising, I think there are some
>>>> opportunities to reduce them significantly.
>>>>
>>>> I propose the introduction of an analysis mode for exporting ASTs. In
>>>> analysis mode the AST exporter would not emit the function body of a
>>>> function for several cases:
>>>> - In case a function is defined in a header, do not emit the body.
>>>> - In case the function was defined in an implicit template
>>>> specialisation, do not emit the body.
>>>>
>>>> I think after similar optimizations it might be feasible to use this
>>>> approach on LLVM scale projects as well, and it would be much easier to
>>>> implement Clang based tools that can utilize cross translation unit
>>>> capabilities.
>>>>
>>>> In case the analyzer gets a new interprocedural analysis method that
>>>> would increase the performance the users of this framework would profit
>>>> from that approach immediately.
>>>>
>>>> Does a framework like this worth mainlining and working on? What do you
>>>> think?
>>>>
>>>> (Note that, AST Importer related improvements are already being
>>>> mainlined by Yury et al. My question is about the "analysis mode" for
>>>> exporting ASTs, and a general framework to consume those exported ASTs.)
>>>>
>>>> Regards,
>>>> Gábor
>>>>
>>>>
>>>>
>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160712/0c3271ad/attachment.html>