<div dir="ltr"><div>Speaking of the pipeline, I think we should strive for a general architecture.</div><div><br></div><div>Basically, the proposal is using the analyses of LLVM IR as an oracle for certain properties of conservatively evaluated functions. In the (possibly far) future, Clang might get additional analyses based on the CFG or a new middle level IR. With the optimal solution it should be possible to replace, add, remove, or maybe even combine oracles easily. I do not insist on large efforts for generalizing as we do not have multiple oracles to verify the approach, but whenever we make a design decision I think this is something that we want to keep in mind.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 6 Aug 2020 at 21:42, Gábor Márton <<a href="mailto:martongabesz@gmail.com">martongabesz@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <<a href="mailto:xazax.hun@gmail.com" target="_blank">xazax.hun@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    Umm, ok!~<br>

    <br>

    Static analysis is commonly run in debug builds and those are

    typically unoptimized. It is not common for a project to have a

    release+asserts build but we are relying on asserts for analysis, so

    debug builds are commonly used for analysis. If your project

    completely ignores debug builds its usefulness drops a lot.<br>

    <br>

    Sounds like we want to disconnect this new fake codegen from

    compiler flags entirely. Like, the AST will depend on compiler

    flags, but we should not be taking -O flags into account at all, but

    pick some default -O2 regardless of flags; and ideally all flags

    should be ignored by default, to ensure experience as consistent as

    possible.<br>

    <br>

    You'd also have to make sure that running CodeGen doesn't have

    unwanted side effects such as emitting a .o file.<br>

    <br>

    Would something like that actually work?<br>

    <br>

    And if it would, would this also address the usual concerns about

    making warnings depend on optimizations? Because, like,

    optimizations now remain consistent and no longer depend on

    optimization flags used for actual code generation or interact with

    code generation; they're now simply another analysis performed on

    the AST that depends solely on the AST.<br>

    <br>

    <div>On 8/6/20 2:06 AM, Gábor Márton wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div dir="ltr">> you're "just" generating llvm::Function for

          a given AST FunctionDecl "real quick" and looking at the

          attributes. This is happening on-demand and cached, right?</div>

        <div>This works differently. We generate the llvm code for the

          whole translation unit during parsing. It is the Parser and

          the Sema that calls into the callbacks of the CodeGenerator

          via the ASTConsumer interface. This is the exact same

          mechanism that is used for the Backend (see the

          BackendConsumer). We register both the CodeGenerator ast

          consumer and the AnalysisAstConsumer with the AnalysisAction

          (we use a MultiplexConsumer). By the time we start the

          symbolic execution in AnalysisConsumer::HandleTranslationUnit,

          the CodeGen is already done (since CodeGen is added first to

          the MultiplexConsumer so its HandleTranslationUnit and other

          callbacks are called back earlier). About caching, the llvm

          code is cached, we generate that only once, then during the

          function call evaluation we search it in the llvm::Module

          using the mangled name as the key (we don't cache the mangled

          names now, but we could).</div>

        <div>It would be possible to directly call the callbacks of the

          CodeGenerator on-demand, without registering that to the

          FrontendAction. Actually, my first attempt was to call

          HandleTopLevelDecl for a given FunctionDecl on demand when we

          needed the llvm code. However, this is a false attempt for the

          following reasons: (1) Could not support ObjC/C++ because I

          could not get all the information that the Sema has when it

          calls to HandleTopLevelDeclInObjCContainer. In fact, I think

          it is not supported to call these callbacks directly, just

          indirectly through a registered ASTConsumer because we may not

          know how the Parser and the Sema calls to these. (2) It is not

          enough to get the llvm code for a function in isolation.

          E.g., for the "readonly" attribute we must enable alias

          analysis on global variables (see GlobalsAAResult), so we must

          emit llvm code for global variables.</div>

        <div><br>

        </div>

        <div>> 1.1. But it sounds like for the CTU users it may

          amplify the imperfections of ASTImporter.</div>

        <div>> 2.1. Again, it's worse with CTU because imported ASTs

          have so far never been tested for compatibility with CodeGen.</div>

        <div>We should not call the CodeGen on a merged AST. ASTImporter

          does not support the ASTConsumer interface. In the case of

          CTU, I think we should generate the IR for each TU in

          isolation. And we should probably want to extend the

          CrossTranslationUnit interface to give back the llvm::Function

          for a given FunctionDecl. Or we could make this more

          transparent and the IRContext in this prototype could be CTU

          aware.</div>

        <div><br>

        </div>

        <div>> Just to be clear, we should definitely avoid having

          our analysis results depend on optimization levels. It should

          be possible to avoid that, right? </div>

        <div>There is a dependency we will never be able to get rid of:

          CodeGen generates<a href="https://llvm.org/docs/LangRef.html#memory-use-markers" target="_blank"> lifetime markers</a> only when the

          optimization level is greater or eq to 2 (-O2, -O3) .These

          lifetime markers are needed to get the precise pureness info

          out of GlobalsAA.</div>

        <div><br>

        </div>

        <div>> The way i imagined this, we're only interested in

          picking up LLVM analyses, which can be run over unoptimized IR

          just fine(?)</div>

        <div>Yes, but we need to set the optimization level so CodeGen

          generates lifetime markers. Indeed, there are many llvm

          analyses that simply do not change the IR and just populate

          their results. And we could simply use the results in CSA.<br>

        </div>

        <div>> We should probably not be optimizing the IR at all in

          the process(?)<br>

        </div>

        <div>Some llvm passes may invalidate the results of previous

          analyses and then we need to rerun those. I am not an expert,

          but I think if we run an analysis again after another analysis

          that optimizes the IR (i.e truncates it) then our results

          could be more precise. And that is the reason why we see

          multiple passes for the same analyses when we do

          optimizations. And perhaps this is the exact job of the

          PassManager to orchestrate this (?). </div>

        <div>There are passes that extend the IR

          (e.g InferFunctionAttrsPass), we may not need these strictly

          speaking, but I really don't know how the different analyses

          use the function attributes.<br>

        </div>

        <div>Maybe we need the IR both in unoptimized form and in

          optimized form. Also, we may want to have our own CSA specific

          pipeline, but having the default O2 pipeline seems to simplify

          things.</div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Wed, Aug 5, 2020 at 11:22

            PM Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div> Just to be clear, we should definitely avoid having

              our analysis results depend on optimization levels. It

              should be possible to avoid that, right? The way i

              imagined this, we're only interested in picking up LLVM

              analyses, which can be run over unoptimized IR just

              fine(?) We should probably not be optimizing the IR at all

              in the process(?)<br>

              <br>

              <div>On 05.08.2020 12:17, Artem Dergachev wrote:<br>

              </div>

              <blockquote type="cite"> I'm excited that this is actually

                moving somewhere!<br>

                <br>

                Let's see what consequences do we have here. I have some

                thoughts but i don't immediately see any architecturally

                catastrophic consequences; you're "just" generating

                llvm::Function for a given AST FunctionDecl "real quick"

                and looking at the attributes. This is happening

                on-demand and cached, right??? I'd love to hear more

                opinions. Here's what i see:<br>

                <br>

                1. We can no longer mutate the AST for analysis purposes

                without the risk of screwing up subsequent codegen. And

                the risk would be pretty high because hand-crafting ASTs

                is extremely difficult. Good thing we aren't actually

                doing this.<br>

                    1.1. But it sounds like for the CTU users it may

                amplify the imperfections of ASTImporter.<br>

                <br>

                2. Ok, yeah, we now may have crashes in CodeGen during

                analysis. Normally they shouldn't be that bad because

                this would mean that CodeGen would crash during normal

                compilation as well. And that's rare; codegen crashes

                are much more rare than analyzer crashes. Of course a

                difference can be triggered by #ifndef

                __clang_analyzer__ but it still remains a proof of valid

                crashing code, so that should be rare.<br>

                    2.1. Again, it's worse with CTU because imported

                ASTs have so far never been tested for compatibility

                with CodeGen.<br>

                <br>

                Let's also talk about the benefits. First of all, *we

                still need the source code available during analysis*.

                This isn't about peeking into binary dependencies and it

                doesn't immediately aid CTU in any way; this is entirely

                about improving upon conservative evaluation on the

                currently available AST, for functions that are already

                available for inlining but are not being inlined for

                whatever reason. In fact, in some cases we may later

                prefer such LLVM IR-based evaluation to inlining, which

                may improve analysis performance (i.e., less path

                explosion) *and* correctness (eg., avoid unjustified

                state splits).<br>

                <br>

                <div>On 05.08.2020 08:29, Gábor Márton via cfe-dev

                  wrote:<br>

                </div>

                <blockquote type="cite">

                  <div dir="ltr">Hi,<br>

                    <div><br>

                    </div>

                    <div>I have been working on a prototype that makes

                      it possible to access the IR from the components

                      of the Clang Static Analyzer.</div>

                    <div><a href="https://reviews.llvm.org/D85319" target="_blank">https://reviews.llvm.org/D85319</a><br>

                    </div>

                    <div><br>

                    </div>

                    <div>There are many important and useful analyses in

                      the LLVM layer that we can use during the path

                      sensitive analysis. Most notably, the "readnone"

                      and "readonly" function attributes (<a href="https://llvm.org/docs/LangRef.html" target="_blank">https://llvm.org/docs/LangRef.html</a>)

                      which can be used to identify "pure" functions

                      (those without side effects). In the prototype I

                      am using the pureness info from the IR to avoid

                      invalidation of any variables during conservative

                      evaluation (when we evaluate a pure function).

                      There are cases when we get false positives

                      exactly because of the too conservative

                      invalidation.</div>

                    <div><br>

                    </div>

                    <div>Some further ideas to use info from the IR:</div>

                    <div>- We should invalidate only the arg regions for

                      functions with "argmemonly" attribute.</div>

                    <div>- Use the smarter invalidation in cross

                      translation unit analysis too. We can get the IR

                      for the other TUs as well.</div>

                    <div>- Run the <a href="https://llvm.org/doxygen/structllvm_1_1Attributor.html" target="_blank">Attributor</a>

                      passes on the IR. We could get range values for

                      return values or for arguments. These range values

                      then could be fed to StdLibraryFunctionsChecker to

                      make the proper assumptions. And we could do this

                      in CTU mode too, these attributes could form some

                      sort of a summary of these functions. Note that I

                      don't expect a meaningful summary for more than a

                      few percent of all the available functions.</div>

                    <div><br>

                    </div>

                    <div>Please let me know if you have any further

                      ideas about how we could use IR attributes (or

                      anything else) during the symbolic execution.</div>

                    <div><br>

                    </div>

                    <div>There are some concerns as well. There may be

                      some source code that we cannot CodeGen, but we

                      can still analyse with the current CSA. That is

                      why I suppress CodeGen diagnostics in the

                      prototype. But in the worst case we may run into

                      assertions in the CodeGen and this may cause

                      regression in the whole analysis experience. This

                      may be the case especially when we get a

                      compile_commands.json from a project that is

                      compiled only with e.g. GCC.</div>

                    <div><br>

                    </div>

                    <div>Thanks,</div>

                    <div>Gabor</div>

                    <div><br>

                    </div>

                    <div> </div>

                  </div>

                  <br>

                  <fieldset></fieldset>

                  <pre>_______________________________________________

cfe-dev mailing list

<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>

</pre>

                </blockquote>

                <br>

              </blockquote>

              <br>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

    <br>

  </div>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div>

</blockquote></div>

</blockquote></div>