[cfe-dev] [analyzer][RFC] Get info from the LLVM IR for precision

Thu Aug 6 12:42:37 PDT 2020

Yes, this is a good point. And a reason to assemble our own CSA specific
llvm pipeline to avoid such removal of the static functions. We may want to
skip the inliner pass. Or ... I assume there is a module pass that removes
the unused static function, so as a better alternative, we could skip that
from the pipeline.

On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <xazax.hun at gmail.com> wrote:

> I like the idea of piggybacking some analysis in the LLVM IR. However, I
> have some concerns as well. I am not well versed in the LLVM optimizer, but
> I do see potential side effects. E.g. what if a static function is inlined
> to ALL call sites, thus the original function can be removed. We will no
> longer be able to get all the useful info for that function? It would be
> unfortunate if the analysis result would depend on inlining heuristics. It
> would make the analyzer even harder to debug or understand.
>
> On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> Umm, ok!~
>>
>> Static analysis is commonly run in debug builds and those are typically
>> unoptimized. It is not common for a project to have a release+asserts build
>> but we are relying on asserts for analysis, so debug builds are commonly
>> used for analysis. If your project completely ignores debug builds its
>> usefulness drops a lot.
>>
>> Sounds like we want to disconnect this new fake codegen from compiler
>> flags entirely. Like, the AST will depend on compiler flags, but we should
>> not be taking -O flags into account at all, but pick some default -O2
>> regardless of flags; and ideally all flags should be ignored by default, to
>> ensure experience as consistent as possible.
>>
>> You'd also have to make sure that running CodeGen doesn't have unwanted
>> side effects such as emitting a .o file.
>>
>> Would something like that actually work?
>>
>> And if it would, would this also address the usual concerns about making
>> warnings depend on optimizations? Because, like, optimizations now remain
>> consistent and no longer depend on optimization flags used for actual code
>> generation or interact with code generation; they're now simply another
>> analysis performed on the AST that depends solely on the AST.
>>
>> On 8/6/20 2:06 AM, Gábor Márton wrote:
>>
>> > you're "just" generating llvm::Function for a given AST FunctionDecl
>> "real quick" and looking at the attributes. This is happening on-demand and
>> cached, right?
>> This works differently. We generate the llvm code for the whole
>> translation unit during parsing. It is the Parser and the Sema that calls
>> into the callbacks of the CodeGenerator via the ASTConsumer interface. This
>> is the exact same mechanism that is used for the Backend (see the
>> BackendConsumer). We register both the CodeGenerator ast consumer and the
>> AnalysisAstConsumer with the AnalysisAction (we use a MultiplexConsumer).
>> By the time we start the symbolic execution in
>> AnalysisConsumer::HandleTranslationUnit, the CodeGen is already done (since
>> CodeGen is added first to the MultiplexConsumer so
>> its HandleTranslationUnit and other callbacks are called back earlier).
>> About caching, the llvm code is cached, we generate that only once, then
>> during the function call evaluation we search it in the llvm::Module using
>> the mangled name as the key (we don't cache the mangled names now, but we
>> could).
>> It would be possible to directly call the callbacks of the CodeGenerator
>> on-demand, without registering that to the FrontendAction. Actually, my
>> first attempt was to call HandleTopLevelDecl for a given FunctionDecl on
>> demand when we needed the llvm code. However, this is a false attempt for
>> the following reasons: (1) Could not support ObjC/C++ because I could not
>> get all the information that the Sema has when it calls to
>> HandleTopLevelDeclInObjCContainer. In fact, I think it is not supported to
>> call these callbacks directly, just indirectly through a registered
>> ASTConsumer because we may not know how the Parser and the Sema calls to
>> these. (2) It is not enough to get the llvm code for a function in
>> isolation. E.g., for the "readonly" attribute we must enable alias analysis
>> on global variables (see GlobalsAAResult), so we must emit llvm code for
>> global variables.
>>
>> > 1.1. But it sounds like for the CTU users it may amplify the
>> imperfections of ASTImporter.
>> > 2.1. Again, it's worse with CTU because imported ASTs have so far never
>> been tested for compatibility with CodeGen.
>> We should not call the CodeGen on a merged AST. ASTImporter does not
>> support the ASTConsumer interface. In the case of CTU, I think we should
>> generate the IR for each TU in isolation. And we should probably want to
>> extend the CrossTranslationUnit interface to give back the llvm::Function
>> for a given FunctionDecl. Or we could make this more transparent and the
>> IRContext in this prototype could be CTU aware.
>>
>> > Just to be clear, we should definitely avoid having our analysis
>> results depend on optimization levels. It should be possible to avoid that,
>> right?
>> There is a dependency we will never be able to get rid of: CodeGen
>> generates lifetime markers
>> <https://llvm.org/docs/LangRef.html#memory-use-markers> only when the
>> optimization level is greater or eq to 2 (-O2, -O3) .These lifetime markers
>> are needed to get the precise pureness info out of GlobalsAA.
>>
>> > The way i imagined this, we're only interested in picking up LLVM
>> analyses, which can be run over unoptimized IR just fine(?)
>> Yes, but we need to set the optimization level so CodeGen generates
>> lifetime markers. Indeed, there are many llvm analyses that simply do not
>> change the IR and just populate their results. And we could simply use the
>> results in CSA.
>> > We should probably not be optimizing the IR at all in the process(?)
>> Some llvm passes may invalidate the results of previous analyses and then
>> we need to rerun those. I am not an expert, but I think if we run an
>> analysis again after another analysis that optimizes the IR (i.e truncates
>> it) then our results could be more precise. And that is the reason why we
>> see multiple passes for the same analyses when we do optimizations. And
>> perhaps this is the exact job of the PassManager to orchestrate this (?).
>> There are passes that extend the IR (e.g InferFunctionAttrsPass), we may
>> not need these strictly speaking, but I really don't know how the different
>> analyses use the function attributes.
>> Maybe we need the IR both in unoptimized form and in optimized form.
>> Also, we may want to have our own CSA specific pipeline, but having the
>> default O2 pipeline seems to simplify things.
>>
>> On Wed, Aug 5, 2020 at 11:22 PM Artem Dergachev <noqnoqneo at gmail.com>
>> wrote:
>>
>>> Just to be clear, we should definitely avoid having our analysis results
>>> depend on optimization levels. It should be possible to avoid that, right?
>>> The way i imagined this, we're only interested in picking up LLVM analyses,
>>> which can be run over unoptimized IR just fine(?) We should probably not be
>>> optimizing the IR at all in the process(?)
>>>
>>> On 05.08.2020 12:17, Artem Dergachev wrote:
>>>
>>> I'm excited that this is actually moving somewhere!
>>>
>>> Let's see what consequences do we have here. I have some thoughts but i
>>> don't immediately see any architecturally catastrophic consequences; you're
>>> "just" generating llvm::Function for a given AST FunctionDecl "real quick"
>>> and looking at the attributes. This is happening on-demand and cached,
>>> right??? I'd love to hear more opinions. Here's what i see:
>>>
>>> 1. We can no longer mutate the AST for analysis purposes without the
>>> risk of screwing up subsequent codegen. And the risk would be pretty high
>>> because hand-crafting ASTs is extremely difficult. Good thing we aren't
>>> actually doing this.
>>>     1.1. But it sounds like for the CTU users it may amplify the
>>> imperfections of ASTImporter.
>>>
>>> 2. Ok, yeah, we now may have crashes in CodeGen during analysis.
>>> Normally they shouldn't be that bad because this would mean that CodeGen
>>> would crash during normal compilation as well. And that's rare; codegen
>>> crashes are much more rare than analyzer crashes. Of course a difference
>>> can be triggered by #ifndef __clang_analyzer__ but it still remains a proof
>>> of valid crashing code, so that should be rare.
>>>     2.1. Again, it's worse with CTU because imported ASTs have so far
>>> never been tested for compatibility with CodeGen.
>>>
>>> Let's also talk about the benefits. First of all, *we still need the
>>> source code available during analysis*. This isn't about peeking into
>>> binary dependencies and it doesn't immediately aid CTU in any way; this is
>>> entirely about improving upon conservative evaluation on the currently
>>> available AST, for functions that are already available for inlining but
>>> are not being inlined for whatever reason. In fact, in some cases we may
>>> later prefer such LLVM IR-based evaluation to inlining, which may improve
>>> analysis performance (i.e., less path explosion) *and* correctness (eg.,
>>> avoid unjustified state splits).
>>>
>>> On 05.08.2020 08:29, Gábor Márton via cfe-dev wrote:
>>>
>>> Hi,
>>>
>>> I have been working on a prototype that makes it possible to access the
>>> IR from the components of the Clang Static Analyzer.
>>> https://reviews.llvm.org/D85319
>>>
>>> There are many important and useful analyses in the LLVM layer that we
>>> can use during the path sensitive analysis. Most notably, the "readnone"
>>> and "readonly" function attributes (https://llvm.org/docs/LangRef.html)
>>> which can be used to identify "pure" functions (those without side
>>> effects). In the prototype I am using the pureness info from the IR to
>>> avoid invalidation of any variables during conservative evaluation (when we
>>> evaluate a pure function). There are cases when we get false positives
>>> exactly because of the too conservative invalidation.
>>>
>>> Some further ideas to use info from the IR:
>>> - We should invalidate only the arg regions for functions with
>>> "argmemonly" attribute.
>>> - Use the smarter invalidation in cross translation unit analysis too.
>>> We can get the IR for the other TUs as well.
>>> - Run the Attributor
>>> <https://llvm.org/doxygen/structllvm_1_1Attributor.html> passes on the
>>> IR. We could get range values for return values or for arguments. These
>>> range values then could be fed to StdLibraryFunctionsChecker to make the
>>> proper assumptions. And we could do this in CTU mode too, these attributes
>>> could form some sort of a summary of these functions. Note that I don't
>>> expect a meaningful summary for more than a few percent of all the
>>> available functions.
>>>
>>> Please let me know if you have any further ideas about how we could use
>>> IR attributes (or anything else) during the symbolic execution.
>>>
>>> There are some concerns as well. There may be some source code that we
>>> cannot CodeGen, but we can still analyse with the current CSA. That is why
>>> I suppress CodeGen diagnostics in the prototype. But in the worst case we
>>> may run into assertions in the CodeGen and this may cause regression in the
>>> whole analysis experience. This may be the case especially when we get a
>>> compile_commands.json from a project that is compiled only with e.g. GCC.
>>>
>>> Thanks,
>>> Gabor
>>>
>>>
>>>
>>> _______________________________________________
>>> cfe-dev mailing listcfe-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>>>
>>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200806/c4e53c55/attachment-0001.html>