[cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision

Sun Aug 16 12:57:01 PDT 2020

Technically all these analyses can be conducted in source-based manner.

And as John says, that'd have the advantage of being more predictable; 
we'd no longer have to investigate sudden changes in analysis results 
that are in fact caused by backend changes. We already are quite 
unpredictable with our analysis, with arbitrary things affecting 
arbitrary things, and that's ok, but it doesn't mean we should make it 
worse. In particular i'm worried for people who treat analyzer warnings 
as errors in their builds; for them any update in the compiler would now 
cause their build to fail, even if we didn't change anything in the 
static analyzer. (Well, for the same reason i generally don't recommend 
treating analyzer warnings as errors).

So i believe that implementing as many of these analyses over the Clang 
CFG (or in many cases it might be over the AST as well) would be 
beneficial and should be done regardless of this experiment. Gabor, how 
much did you try that? Because i believe you should try that and compare 
the results, at least for some analyses that are easy to implement.

The reason why the use of LLVM IR in the static analyzer gets really 
interesting is because there are already a huge lot of analyses already 
implemented over it and getting access to them "for free" (in terms of 
implementation cost) is fairly tempting. I think that's the only real 
reason; it's a pretty big reason though, because the amount of work 
we're saving us this way may be pretty large if we put a lot of those 
analyses to good use.

On 14.08.2020 04:19, Gábor Márton wrote:
> John, thank you for your reply.
>
> > Is this really the most reasonable way to get the information you want?
> Here is a list of information we would like to have access to (this is 
> non-comprehensive, Artem could probably extend it) :
> 1) Is a function pure?
> 2) Does a function read/write only the memory pointed to by its arguments?
> 3) Does a calle make any copies of the pointer argument that outlive 
> the callee itself?
> 4) Value ranges.
> 5) Is a loop dead?
> 6) Is a parameter or return pointer is dereferenceable?
>
> How could we use this information?
> With 1-3 we could make the analysis more precise by improving the 
> over-approximation done by invalidation during conservative evaluation.
> Using the info from 1-4 we could create "summaries" for functions and 
> we could skip the inlining based evaluation of them. This would be 
> really beneficial in case of cross-translation-unit analysis where the 
> inling stack can grow really deep.
> With 5, we could skip the analysis of dead loops and thus could spare 
> the budget for the symbolic execution in CSA.
> By using 6, we could eliminate some false-positive reports, this way 
> improving correctness.
>
> Some of the analyses that provide the needed information can be 
> implemented properly only by using the SSA form. For example, value 
> range propagation. We could do our own way of lowering to SSA, or our 
> own implementation of alias analysis for the pureness info, but that 
> would be repeating the work that had already been done and well tested 
> in LLVM.
>
> > It’s also pretty expensive.
> I completely agree that we should not pay for those optimization 
> passes whose results we cannot use in the CSA. In the first version of 
> the patch I used the whole O2 pipeline, but lately I updated it to use 
> only those passes that are needed to get the pureness information 
> (GlobalsAA and PostOrderFunctionAttrs).
> Also, static analysis is generally considered to be slower than 
> compilation even with optimizations enabled. We even advertise this in 
> our official webpage (here <https://clang-analyzer.llvm.org/>). 
> And this extension will never be more expensive than a regular O2/O3 
> compilation. So, this implies that a 2-4x slowdown of CSA could become 
> a 3-5x slowdown, compared to an O2 compilation. In CTU mode, the 
> analysis time is even slower currently, so the additional CodeGen 
> would be less noticable. The slowdown may not be affordable for some 
> clients, so users must explicitly require CodeGen in CSA via a 
> command-line switch. I plan to provide precise results on open-source 
> projects to measure the slowdown. On top of that, it would be 
> interesting to see how many times can we get the desired information 
> in the ratio of all functions (all call sites, all loops).
>
> Gabor.
>
>
>
> On Fri, Aug 14, 2020 at 6:46 AM John McCall <rjmccall at apple.com 
> <mailto:rjmccall at apple.com>> wrote:
>
>     On 13 Aug 2020, at 10:15, Gábor Márton wrote:
>     > Artem, John,
>     >
>     > How should we proceed with this?
>     >
>     > John, you mention in the patch that this is a huge architectural
>     > change.
>     > Could you please elaborate? Are you concerned about the additional
>     > libs
>     > that are being linked to the static analyzer libraries? The clang
>     > binary is
>     > already dependent on LLVM libs and on the CodeGen and CSA is
>     builtin
>     > to the
>     > clang binary. Are you concerned about having a MultiplexConsumer
>     as an
>     > ASTConsumer? ... I am open to any suggestions, but I need more input
>     > from you.
>
>     Well, it’s adding a major new dependency to the static analyzer and a
>     major new client to IRGen.  In both cases, the dependency/client
>     happens
>     to be another part of Clang, but still, it seems like a huge deal for
>     static analysis to start depending on potentially arbitrary
>     details of
>     code generation and LLVM optimization.  It’s also pretty expensive.
>     Is this really the most reasonable way to get the information you
>     want?
>
>     John.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200816/eda000c5/attachment.html>