[cfe-dev] Two pass analysis framework: AST merging approach

Wed May 4 07:11:51 PDT 2016

Manuel,

Yes, you're right with "one large glob" approach which may be 
non-suitable for analysis of a whole big program. However, in CSA we're 
not going to build an AST for the entire program since our analysis 
depth is limited. We just merging a limited amount of declarations to 
the current AST so its growth is not too big.
>> Hi!
>>
>> This e-mail is a proposal based on the work done by Yury Gibrov et al.:
>> http://lists.llvm.org/pipermail/cfe-dev/2015-December/046299.html
>>
>> They accomplished a two pass analysis, the first pass is serializing the
>> AST of every translation unit and creates an index of functions, the second
>> pass does the real analysis, which can load the AST of function bodies on
>> demand.
>>
>> This approach can be used to achieve cross translation unit analysis for
>> the clang Static Analyzer to some extent, but similar approach could be
>> applicable to Clang Tidy and other clang based tools.
>>
>> While this method is not likely to be a silver bullet for the Static
>> Analyzer, I did some benchmarks to see how feasible this approach is. The
>> baseline was running the Static Analyzer without the two pass analyis, the
>> second one was running using the framework linked above.
>>
>> For a 150k LOC C projects I got the following results:
>> The size of the serialized ASTs was: 140MB
>> The size of the indexes (textual representation): 4.4MB
>> The time of the analysis was bellow 4X
>> The amount of memory consumed was bellow 2X
>>
>> All in all it looks like a feasible approach for some use cases.
>>
>> I also tried to do a benchmark on the LLVM+Clang codebase. Unfortunately I
>> was not able to run the analysis due to some missing features in the AST
>> Importer. But I was able to serialize the ASTs and generate the indices:
>> The siye of the serialized ASTs: 45.4 GB
>> The siye of the function index: 1,6GB
>>
>> While these numbers are less promising, I think there are some
>> opportunities to reduce them significantly.
>>
>> I propose the introduction of an analysis mode for exporting ASTs. In
>> analysis mode the AST exporter would not emit the function body of a
>> function for several cases:
>> - In case a function is defined in a header, do not emit the body.
>> - In case the function was defined in an implicit template specialisation,
>> do not emit the body.
>>
>> I think after similar optimizations it might be feasible to use this
>> approach on LLVM scale projects as well, and it would be much easier to
>> implement Clang based tools that can utilize cross translation unit
>> capabilities.
>>
> I agree that we want cross translation unit analysis to be simpler to
> implement, but I think that parallelization of single steps will still be
> key for usability. Thus, I'm not convinced the "one large glob" approach is
> going to play out (but I might well be wrong).
>

-- 
Best regards,
Aleksei Sidorin
Software Engineer,
IMSWL-IMCG, SRR, Samsung Electronics