[cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision

Thu Aug 6 10:20:34 PDT 2020

Umm, ok!~

Static analysis is commonly run in debug builds and those are typically 
unoptimized. It is not common for a project to have a release+asserts 
build but we are relying on asserts for analysis, so debug builds are 
commonly used for analysis. If your project completely ignores debug 
builds its usefulness drops a lot.

Sounds like we want to disconnect this new fake codegen from compiler 
flags entirely. Like, the AST will depend on compiler flags, but we 
should not be taking -O flags into account at all, but pick some default 
-O2 regardless of flags; and ideally all flags should be ignored by 
default, to ensure experience as consistent as possible.

You'd also have to make sure that running CodeGen doesn't have unwanted 
side effects such as emitting a .o file.

Would something like that actually work?

And if it would, would this also address the usual concerns about making 
warnings depend on optimizations? Because, like, optimizations now 
remain consistent and no longer depend on optimization flags used for 
actual code generation or interact with code generation; they're now 
simply another analysis performed on the AST that depends solely on the AST.

On 8/6/20 2:06 AM, Gábor Márton wrote:
> > you're "just" generating llvm::Function for a given AST FunctionDecl 
> "real quick" and looking at the attributes. This is happening 
> on-demand and cached, right?
> This works differently. We generate the llvm code for the whole 
> translation unit during parsing. It is the Parser and the Sema that 
> calls into the callbacks of the CodeGenerator via the ASTConsumer 
> interface. This is the exact same mechanism that is used for the 
> Backend (see the BackendConsumer). We register both the CodeGenerator 
> ast consumer and the AnalysisAstConsumer with the AnalysisAction (we 
> use a MultiplexConsumer). By the time we start the symbolic execution 
> in AnalysisConsumer::HandleTranslationUnit, the CodeGen is already 
> done (since CodeGen is added first to the MultiplexConsumer so 
> its HandleTranslationUnit and other callbacks are called back 
> earlier). About caching, the llvm code is cached, we generate that 
> only once, then during the function call evaluation we search it in 
> the llvm::Module using the mangled name as the key (we don't cache the 
> mangled names now, but we could).
> It would be possible to directly call the callbacks of the 
> CodeGenerator on-demand, without registering that to the 
> FrontendAction. Actually, my first attempt was to call 
> HandleTopLevelDecl for a given FunctionDecl on demand when we needed 
> the llvm code. However, this is a false attempt for the following 
> reasons: (1) Could not support ObjC/C++ because I could not get all 
> the information that the Sema has when it calls to 
> HandleTopLevelDeclInObjCContainer. In fact, I think it is not 
> supported to call these callbacks directly, just indirectly through a 
> registered ASTConsumer because we may not know how the Parser and the 
> Sema calls to these. (2) It is not enough to get the llvm code for a 
> function in isolation. E.g., for the "readonly" attribute we must 
> enable alias analysis on global variables (see GlobalsAAResult), so we 
> must emit llvm code for global variables.
>
> > 1.1. But it sounds like for the CTU users it may amplify the 
> imperfections of ASTImporter.
> > 2.1. Again, it's worse with CTU because imported ASTs have so far 
> never been tested for compatibility with CodeGen.
> We should not call the CodeGen on a merged AST. ASTImporter does not 
> support the ASTConsumer interface. In the case of CTU, I think we 
> should generate the IR for each TU in isolation. And we should 
> probably want to extend the CrossTranslationUnit interface to give 
> back the llvm::Function for a given FunctionDecl. Or we could make 
> this more transparent and the IRContext in this prototype could be CTU 
> aware.
>
> > Just to be clear, we should definitely avoid having our analysis 
> results depend on optimization levels. It should be possible to avoid 
> that, right?
> There is a dependency we will never be able to get rid of: CodeGen 
> generateslifetime markers 
> <https://llvm.org/docs/LangRef.html#memory-use-markers> only when the 
> optimization level is greater or eq to 2 (-O2, -O3) .These lifetime 
> markers are needed to get the precise pureness info out of GlobalsAA.
>
> > The way i imagined this, we're only interested in picking up LLVM 
> analyses, which can be run over unoptimized IR just fine(?)
> Yes, but we need to set the optimization level so CodeGen generates 
> lifetime markers. Indeed, there are many llvm analyses that simply do 
> not change the IR and just populate their results. And we could simply 
> use the results in CSA.
> > We should probably not be optimizing the IR at all in the process(?)
> Some llvm passes may invalidate the results of previous analyses and 
> then we need to rerun those. I am not an expert, but I think if we run 
> an analysis again after another analysis that optimizes the IR (i.e 
> truncates it) then our results could be more precise. And that is the 
> reason why we see multiple passes for the same analyses when we do 
> optimizations. And perhaps this is the exact job of the PassManager to 
> orchestrate this (?).
> There are passes that extend the IR (e.g InferFunctionAttrsPass), we 
> may not need these strictly speaking, but I really don't know how the 
> different analyses use the function attributes.
> Maybe we need the IR both in unoptimized form and in optimized form. 
> Also, we may want to have our own CSA specific pipeline, but having 
> the default O2 pipeline seems to simplify things.
>
> On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <noqnoqneo at gmail.com 
> <mailto:noqnoqneo at gmail.com>> wrote:
>
>     Just to be clear, we should definitely avoid having our analysis
>     results depend on optimization levels. It should be possible to
>     avoid that, right? The way i imagined this, we're only interested
>     in picking up LLVM analyses, which can be run over unoptimized IR
>     just fine(?) We should probably not be optimizing the IR at all in
>     the process(?)
>
>     On 05.08.2020 12:17, Artem Dergachev wrote:
>>     I'm excited that this is actually moving somewhere!
>>
>>     Let's see what consequences do we have here. I have some thoughts
>>     but i don't immediately see any architecturally catastrophic
>>     consequences; you're "just" generating llvm::Function for a given
>>     AST FunctionDecl "real quick" and looking at the attributes. This
>>     is happening on-demand and cached, right??? I'd love to hear more
>>     opinions. Here's what i see:
>>
>>     1. We can no longer mutate the AST for analysis purposes without
>>     the risk of screwing up subsequent codegen. And the risk would be
>>     pretty high because hand-crafting ASTs is extremely difficult.
>>     Good thing we aren't actually doing this.
>>         1.1. But it sounds like for the CTU users it may amplify the
>>     imperfections of ASTImporter.
>>
>>     2. Ok, yeah, we now may have crashes in CodeGen during analysis.
>>     Normally they shouldn't be that bad because this would mean that
>>     CodeGen would crash during normal compilation as well. And that's
>>     rare; codegen crashes are much more rare than analyzer crashes.
>>     Of course a difference can be triggered by #ifndef
>>     __clang_analyzer__ but it still remains a proof of valid crashing
>>     code, so that should be rare.
>>         2.1. Again, it's worse with CTU because imported ASTs have so
>>     far never been tested for compatibility with CodeGen.
>>
>>     Let's also talk about the benefits. First of all, *we still need
>>     the source code available during analysis*. This isn't about
>>     peeking into binary dependencies and it doesn't immediately aid
>>     CTU in any way; this is entirely about improving upon
>>     conservative evaluation on the currently available AST, for
>>     functions that are already available for inlining but are not
>>     being inlined for whatever reason. In fact, in some cases we may
>>     later prefer such LLVM IR-based evaluation to inlining, which may
>>     improve analysis performance (i.e., less path explosion) *and*
>>     correctness (eg., avoid unjustified state splits).
>>
>>     On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
>>>     Hi,
>>>
>>>     I have been working on a prototype that makes it possible to
>>>     access the IR from the components of the Clang Static Analyzer.
>>>     https://reviews.llvm.org/D85319
>>>
>>>     There are many important and useful analyses in the LLVM layer
>>>     that we can use during the path sensitive analysis. Most
>>>     notably, the "readnone" and "readonly" function attributes
>>>     (https://llvm.org/docs/LangRef.html) which can be used to
>>>     identify "pure" functions (those without side effects). In the
>>>     prototype I am using the pureness info from the IR to avoid
>>>     invalidation of any variables during conservative evaluation
>>>     (when we evaluate a pure function). There are cases when we get
>>>     false positives exactly because of the too conservative
>>>     invalidation.
>>>
>>>     Some further ideas to use info from the IR:
>>>     - We should invalidate only the arg regions for functions with
>>>     "argmemonly" attribute.
>>>     - Use the smarter invalidation in cross translation unit
>>>     analysis too. We can get the IR for the other TUs as well.
>>>     - Run the Attributor
>>>     <https://llvm.org/doxygen/structllvm_1_1Attributor.html> passes
>>>     on the IR. We could get range values for return values or for
>>>     arguments. These range values then could be fed to
>>>     StdLibraryFunctionsChecker to make the proper assumptions. And
>>>     we could do this in CTU mode too, these attributes could form
>>>     some sort of a summary of these functions. Note that I don't
>>>     expect a meaningful summary for more than a few percent of all
>>>     the available functions.
>>>
>>>     Please let me know if you have any further ideas about how we
>>>     could use IR attributes (or anything else) during the symbolic
>>>     execution.
>>>
>>>     There are some concerns as well. There may be some source code
>>>     that we cannot CodeGen, but we can still analyse with the
>>>     current CSA. That is why I suppress CodeGen diagnostics in the
>>>     prototype. But in the worst case we may run into assertions in
>>>     the CodeGen and this may cause regression in the whole analysis
>>>     experience. This may be the case especially when we get a
>>>     compile_commands.json from a project that is compiled only with
>>>     e.g. GCC.
>>>
>>>     Thanks,
>>>     Gabor
>>>
>>>
>>>     _______________________________________________
>>>     cfe-dev mailing list
>>>     cfe-dev at lists.llvm.org  <mailto:cfe-dev at lists.llvm.org>
>>>     https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200806/cc129ef8/attachment-0001.html>