<div dir="ltr">Artem, John,<div><br></div><div>How should we proceed with this?</div><div><br></div><div>John, you mention in the patch that this is a huge architectural change. Could you please elaborate? Are you concerned about the additional libs that are being linked to the static analyzer libraries? The clang binary is already dependent on LLVM libs and on the CodeGen and CSA is builtin to the clang binary. Are you concerned about having a MultiplexConsumer as an ASTConsumer? ... I am open to any suggestions, but I need more input from you.</div><div><br></div><div>Many thanks,</div><div>Gabor</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 11, 2020 at 5:49 PM Gábor Márton <<a href="mailto:martongabesz@gmail.com">martongabesz@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">I have updated the patch and addressed all concerns hopefully. Now the pipeline contains only those passes that are needed to get the pureness information (GlobalsAA and PostOrderFunctionAttrs). So, we have our CSA specific pipeline now. I also added some unittest (and changed the lit test) to demonstrate that we can get attributes for static functions.<div>Inlining is now omitted from our pipeline, but I have a gut feeling that this could result in less precise results for some other llvm analyses which we might want to run in the future. But, for now let's keep the pipeline to the minimum, later we may have several pipelines for different needs.</div><div><br></div><div>> Clang might get additional analyses based on the CFG or a new middle level IR.</div><div>Whenever we'll have middle level IR, then we could build middle level pipelines with similar changes in the architecture: adding a new ASTConsumer for the middle level codegen. But this does not seem to happen in the foreseeable future, so I'd suggest let's focus on the LLVM IR for now.</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 7, 2020 at 10:35 AM Gábor Horváth <<a href="mailto:xazax.hun@gmail.com" target="_blank">xazax.hun@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Speaking of the pipeline, I think we should strive for a general architecture.</div><div><br></div><div>Basically, the proposal is using the analyses of LLVM IR as an oracle for certain properties of conservatively evaluated functions. In the (possibly far) future, Clang might get additional analyses based on the CFG or a new middle level IR. With the optimal solution it should be possible to replace, add, remove, or maybe even combine oracles easily. I do not insist on large efforts for generalizing as we do not have multiple oracles to verify the approach, but whenever we make a design decision I think this is something that we want to keep in mind.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 6 Aug 2020 at 21:42, Gábor Márton <<a href="mailto:martongabesz@gmail.com" target="_blank">martongabesz@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Yes, this is a good point. And a reason to assemble our own CSA specific llvm pipeline to avoid such removal of the static functions. We may want to skip the inliner pass. Or ... I assume there is a module pass that removes the unused static function, so as a better alternative, we could skip that from the pipeline.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Aug 6, 2020 at 8:24 PM Gábor Horváth <<a href="mailto:xazax.hun@gmail.com" target="_blank">xazax.hun@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I like the idea of piggybacking some analysis in the LLVM IR. However, I have some concerns as well. I am not well versed in the LLVM optimizer, but I do see potential side effects. E.g. what if a static function is inlined to ALL call sites, thus the original function can be removed. We will no longer be able to get all the useful info for that function? It would be unfortunate if the analysis result would depend on inlining heuristics. It would make the analyzer even harder to debug or understand.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, 6 Aug 2020 at 19:20, Artem Dergachev via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
Umm, ok!~<br>
<br>
Static analysis is commonly run in debug builds and those are
typically unoptimized. It is not common for a project to have a
release+asserts build but we are relying on asserts for analysis, so
debug builds are commonly used for analysis. If your project
completely ignores debug builds its usefulness drops a lot.<br>
<br>
Sounds like we want to disconnect this new fake codegen from
compiler flags entirely. Like, the AST will depend on compiler
flags, but we should not be taking -O flags into account at all, but
pick some default -O2 regardless of flags; and ideally all flags
should be ignored by default, to ensure experience as consistent as
possible.<br>
<br>
You'd also have to make sure that running CodeGen doesn't have
unwanted side effects such as emitting a .o file.<br>
<br>
Would something like that actually work?<br>
<br>
And if it would, would this also address the usual concerns about
making warnings depend on optimizations? Because, like,
optimizations now remain consistent and no longer depend on
optimization flags used for actual code generation or interact with
code generation; they're now simply another analysis performed on
the AST that depends solely on the AST.<br>
<br>
<div>On 8/6/20 2:06 AM, Gábor Márton wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">> you're "just" generating llvm::Function for
a given AST FunctionDecl "real quick" and looking at the
attributes. This is happening on-demand and cached, right?</div>
<div>This works differently. We generate the llvm code for the
whole translation unit during parsing. It is the Parser and
the Sema that calls into the callbacks of the CodeGenerator
via the ASTConsumer interface. This is the exact same
mechanism that is used for the Backend (see the
BackendConsumer). We register both the CodeGenerator ast
consumer and the AnalysisAstConsumer with the AnalysisAction
(we use a MultiplexConsumer). By the time we start the
symbolic execution in AnalysisConsumer::HandleTranslationUnit,
the CodeGen is already done (since CodeGen is added first to
the MultiplexConsumer so its HandleTranslationUnit and other
callbacks are called back earlier). About caching, the llvm
code is cached, we generate that only once, then during the
function call evaluation we search it in the llvm::Module
using the mangled name as the key (we don't cache the mangled
names now, but we could).</div>
<div>It would be possible to directly call the callbacks of the
CodeGenerator on-demand, without registering that to the
FrontendAction. Actually, my first attempt was to call
HandleTopLevelDecl for a given FunctionDecl on demand when we
needed the llvm code. However, this is a false attempt for the
following reasons: (1) Could not support ObjC/C++ because I
could not get all the information that the Sema has when it
calls to HandleTopLevelDeclInObjCContainer. In fact, I think
it is not supported to call these callbacks directly, just
indirectly through a registered ASTConsumer because we may not
know how the Parser and the Sema calls to these. (2) It is not
enough to get the llvm code for a function in isolation.
E.g., for the "readonly" attribute we must enable alias
analysis on global variables (see GlobalsAAResult), so we must
emit llvm code for global variables.</div>
<div><br>
</div>
<div>> 1.1. But it sounds like for the CTU users it may
amplify the imperfections of ASTImporter.</div>
<div>> 2.1. Again, it's worse with CTU because imported ASTs
have so far never been tested for compatibility with CodeGen.</div>
<div>We should not call the CodeGen on a merged AST. ASTImporter
does not support the ASTConsumer interface. In the case of
CTU, I think we should generate the IR for each TU in
isolation. And we should probably want to extend the
CrossTranslationUnit interface to give back the llvm::Function
for a given FunctionDecl. Or we could make this more
transparent and the IRContext in this prototype could be CTU
aware.</div>
<div><br>
</div>
<div>> Just to be clear, we should definitely avoid having
our analysis results depend on optimization levels. It should
be possible to avoid that, right? </div>
<div>There is a dependency we will never be able to get rid of:
CodeGen generates<a href="https://llvm.org/docs/LangRef.html#memory-use-markers" target="_blank"> lifetime markers</a> only when the
optimization level is greater or eq to 2 (-O2, -O3) .These
lifetime markers are needed to get the precise pureness info
out of GlobalsAA.</div>
<div><br>
</div>
<div>> The way i imagined this, we're only interested in
picking up LLVM analyses, which can be run over unoptimized IR
just fine(?)</div>
<div>Yes, but we need to set the optimization level so CodeGen
generates lifetime markers. Indeed, there are many llvm
analyses that simply do not change the IR and just populate
their results. And we could simply use the results in CSA.<br>
</div>
<div>> We should probably not be optimizing the IR at all in
the process(?)<br>
</div>
<div>Some llvm passes may invalidate the results of previous
analyses and then we need to rerun those. I am not an expert,
but I think if we run an analysis again after another analysis
that optimizes the IR (i.e truncates it) then our results
could be more precise. And that is the reason why we see
multiple passes for the same analyses when we do
optimizations. And perhaps this is the exact job of the
PassManager to orchestrate this (?). </div>
<div>There are passes that extend the IR
(e.g InferFunctionAttrsPass), we may not need these strictly
speaking, but I really don't know how the different analyses
use the function attributes.<br>
</div>
<div>Maybe we need the IR both in unoptimized form and in
optimized form. Also, we may want to have our own CSA specific
pipeline, but having the default O2 pipeline seems to simplify
things.</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Aug 5, 2020 at 11:22
PM Artem Dergachev <<a href="mailto:noqnoqneo@gmail.com" target="_blank">noqnoqneo@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div> Just to be clear, we should definitely avoid having
our analysis results depend on optimization levels. It
should be possible to avoid that, right? The way i
imagined this, we're only interested in picking up LLVM
analyses, which can be run over unoptimized IR just
fine(?) We should probably not be optimizing the IR at all
in the process(?)<br>
<br>
<div>On 05.08.2020 12:17, Artem Dergachev wrote:<br>
</div>
<blockquote type="cite"> I'm excited that this is actually
moving somewhere!<br>
<br>
Let's see what consequences do we have here. I have some
thoughts but i don't immediately see any architecturally
catastrophic consequences; you're "just" generating
llvm::Function for a given AST FunctionDecl "real quick"
and looking at the attributes. This is happening
on-demand and cached, right??? I'd love to hear more
opinions. Here's what i see:<br>
<br>
1. We can no longer mutate the AST for analysis purposes
without the risk of screwing up subsequent codegen. And
the risk would be pretty high because hand-crafting ASTs
is extremely difficult. Good thing we aren't actually
doing this.<br>
1.1. But it sounds like for the CTU users it may
amplify the imperfections of ASTImporter.<br>
<br>
2. Ok, yeah, we now may have crashes in CodeGen during
analysis. Normally they shouldn't be that bad because
this would mean that CodeGen would crash during normal
compilation as well. And that's rare; codegen crashes
are much more rare than analyzer crashes. Of course a
difference can be triggered by #ifndef
__clang_analyzer__ but it still remains a proof of valid
crashing code, so that should be rare.<br>
2.1. Again, it's worse with CTU because imported
ASTs have so far never been tested for compatibility
with CodeGen.<br>
<br>
Let's also talk about the benefits. First of all, *we
still need the source code available during analysis*.
This isn't about peeking into binary dependencies and it
doesn't immediately aid CTU in any way; this is entirely
about improving upon conservative evaluation on the
currently available AST, for functions that are already
available for inlining but are not being inlined for
whatever reason. In fact, in some cases we may later
prefer such LLVM IR-based evaluation to inlining, which
may improve analysis performance (i.e., less path
explosion) *and* correctness (eg., avoid unjustified
state splits).<br>
<br>
<div>On 05.08.2020 08:29, Gábor Márton via cfe-dev
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi,<br>
<div><br>
</div>
<div>I have been working on a prototype that makes
it possible to access the IR from the components
of the Clang Static Analyzer.</div>
<div><a href="https://reviews.llvm.org/D85319" target="_blank">https://reviews.llvm.org/D85319</a><br>
</div>
<div><br>
</div>
<div>There are many important and useful analyses in
the LLVM layer that we can use during the path
sensitive analysis. Most notably, the "readnone"
and "readonly" function attributes (<a href="https://llvm.org/docs/LangRef.html" target="_blank">https://llvm.org/docs/LangRef.html</a>)
which can be used to identify "pure" functions
(those without side effects). In the prototype I
am using the pureness info from the IR to avoid
invalidation of any variables during conservative
evaluation (when we evaluate a pure function).
There are cases when we get false positives
exactly because of the too conservative
invalidation.</div>
<div><br>
</div>
<div>Some further ideas to use info from the IR:</div>
<div>- We should invalidate only the arg regions for
functions with "argmemonly" attribute.</div>
<div>- Use the smarter invalidation in cross
translation unit analysis too. We can get the IR
for the other TUs as well.</div>
<div>- Run the <a href="https://llvm.org/doxygen/structllvm_1_1Attributor.html" target="_blank">Attributor</a>
passes on the IR. We could get range values for
return values or for arguments. These range values
then could be fed to StdLibraryFunctionsChecker to
make the proper assumptions. And we could do this
in CTU mode too, these attributes could form some
sort of a summary of these functions. Note that I
don't expect a meaningful summary for more than a
few percent of all the available functions.</div>
<div><br>
</div>
<div>Please let me know if you have any further
ideas about how we could use IR attributes (or
anything else) during the symbolic execution.</div>
<div><br>
</div>
<div>There are some concerns as well. There may be
some source code that we cannot CodeGen, but we
can still analyse with the current CSA. That is
why I suppress CodeGen diagnostics in the
prototype. But in the worst case we may run into
assertions in the CodeGen and this may cause
regression in the whole analysis experience. This
may be the case especially when we get a
compile_commands.json from a project that is
compiled only with e.g. GCC.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Gabor</div>
<div><br>
</div>
<div> </div>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
cfe-dev mailing list
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a>
</pre>
</blockquote>
<br>
</blockquote>
<br>
</div>
</blockquote>
</div>
</div>
</blockquote>
<br>
</div>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>