[cfe-dev] Reporting UBSan issues at compile-time

Thu Mar 23 17:55:06 PDT 2017

On 3/23/2017 4:59 PM, David Blaikie via cfe-dev wrote:
> On Thu, Mar 23, 2017 at 4:47 PM Vedant Kumar <vsk at apple.com 
> <mailto:vsk at apple.com>> wrote:
>
>     >
>     >> On Mar 23, 2017, at 4:32 PM, David Blaikie <dblaikie at gmail.com
>     <mailto:dblaikie at gmail.com>> wrote:
>     >>
>     >>
>     >>
>     >> On Thu, Mar 23, 2017 at 4:23 PM Vedant Kumar via cfe-dev
>     <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>     >>
>     >> > On Mar 23, 2017, at 1:42 PM, Anna Zaks via cfe-dev
>     <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>     >> >
>     >> >
>     >> >> On Mar 23, 2017, at 11:48 AM, Reid Kleckner via cfe-dev
>     <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>     >> >>
>     >> >> On Thu, Mar 23, 2017 at 10:55 AM, David Blaikie
>     <dblaikie at gmail.com <mailto:dblaikie at gmail.com>> wrote:
>     >> >> On Thu, Mar 23, 2017 at 10:45 AM Reid Kleckner
>     <rnk at google.com <mailto:rnk at google.com>> wrote:
>     >> >> I don't think it will be feasible to generalize UBSan's
>     knowledge to the static analyzer.
>     >> >>
>     >> >> Why not? The rough idea I meant would be to express the
>     constraints UBSan is checking into the static analyzer - I realize
>     the current layering (UBSan being in Clang's IRGen) doesn't make
>     that trivial/obvious, but it seems to me that the constraints
>     could be shared in some way - with some work.
>     >> >>
>     >> >> Maybe I am not imaginative enough, but I cannot envision a
>     clean way to express the conditions that trigger UB that is useful
>     for both static analysis and dynamic instrumentation. The best I
>     can come up with is AST-level instrumentation: synthesizing AST
>     nodes that can be translated to IR or used for analysis. That
>     doesn't seem reasonable, so I think getting ubsan into the static
>     analyzer would end up duplicating the knowledge of what conditions
>     trigger UB.
>     >> >>
>     >> >> The static analyzer CFG is also at best an approximation of
>     the real CFG, especially for C++.
>     >> >>
>     >> >
>     >> > First, the CFG is not only used by the analyzer, but it is
>     also used by Sema (ex: unreachable code warnings and uninitialized
>     variables).
>     >> > Second, while there are corners of C++ that are not
>     supported, it has high fidelity otherwise.
>     >> >
>     >> >> Sure enough - and I believe some of the people
>     working/caring about it would like to fix that. I think Manuel &
>     Chandler have expressed the notion that the best way to do that
>     would be to move to a world where the CFG is used for CodeGen, so
>     it's a single/consistent source of truth.
>     >> >>
>     >> >> Yes, we could do all that work, but LLVM's CFG today is
>     already precise for C++. If we allow ourselves to emit diagnostics
>     from the middle-end, we can save all that work.
>     >> >>
>     >> >> Going down the high-effort path of extending the CFG and
>     abstracting or duplicating UBSan’s checks as static analyses on
>     that CFG would definitely provide a better diagnostic experience,
>     but it's worth re-examining conventional wisdom and exploring
>     alternatives first.
>     >> >
>     >> > The idea of analysis based on top of LLVM IR is not new and
>     have been discussed before. My personal belief is that having
>     access to the AST (or just code as was written by the user) is
>     very important.
>     >>
>     >> UBSan diagnostics provide a filename, line #, and column #.
>     There is enough information here for a clang-based tool to
>     retrieve an AST node, and generate an appropriate diagnostic. In
>     the non-LTO case, we'd have an ASTContext ready to go at the point
>     where we get the diagnostics back from LLVM. In the LTO case, we
>     could serialize the diagnostics and pass them to a clang tool for
>     processing.
>     >>
>     >>
>     >> > It ensures we can provide precise diagnostics. It also allows
>     us to see when users want to suppress an issue report by changing
>     the way the source code is uttered. For example, allows to tell
>     the developer that they can suppress with a cast.
>     >>
>     >> When users of a hypothetical "-Woptimizer" need to suppress
>     diagnostics, they could use the no_sanitize function attribute, or
>     sanitizer blacklists. These are big hammers: if it proves to be
>     too clumsy, we could find a way to do more fine-grained sanitizer
>     issue suppression.
>     >>
>     >>
>     >> > Ensuring that clang CFG completely supports C++ is a bit
>     challenging but not insurmountable task. In addition, fixing-up
>     the unsupported corners in the CFG would benefit not only the
>     clang static analyzer, but all the other “users” such as clang
>     warnings, clang-tidy, possibly even refactoring in the future.
>     >> >
>     >> > Here is a thread where this has been discussed before:
>     >> >
>     http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html
>     >>
>     >> I've tried to summarize this thread:
>     >>
>     >> --- Slides & Minutes from the Static Analyzer BoF ---
>     >>
>     >> * Cross Translation Unit Static Analysis: The proposed approach
>     is to generate summaries for each TU, and to teach the analyses to
>     emit diagnostics based on the summaries. I don't know the current
>     status of this, or what using summaries does to false-positive rates.
>     >>
>     >> * A Clang IR: Modeling things like temporaries is hard, so is
>     dealing with language changes. One proposal is to create an
>     alternate IR for C, then lower it to LLVM IR. The argument is that
>     analyses would be easier to perform on the "C/C++" IR, because
>     CodeGen/StaticAnalyzer would operate on the same structures. There
>     are concerns raised about the complexity of this undertaking
>     >>
>     >> * A Rich Type System in LLVM IR: An alternate proposal is to
>     add a rich, language-agnostic type system to LLVM IR, with
>     pointers back to frontend AST's. Under this proposal, I'm assuming
>     LLVM would be responsible for generating diagnostics. There were
>     concerns raised about having optimizations preserve the type
>     information, about AST node lifetime issues, and about layering
>     violations (i.e LLVM shouldn't be dealing with frontend stuff).
>     >>
>     >> ---
>     >>
>     >> ISTM that work on improving C++ support in the clang CFG can
>     proceed in parallel to work on reporting sanitizer diagnostics
>     from the middle-end. I'd like to know what others think about this.
>     >>
>     >> Just to recap, the approach I have in mind is different from
>     the options discussed in the thread Anna linked to. It's
>     advantages are that;
>     >>
>     >> * There is a clear path towards reporting issues that arise
>     when LTO is enabled.
>     >>
>     >> * We can design it s.t there aren't false positives, with the
>     caveat that we will annoy users about bugs in dead code.
>     >>
>     >> * It's extensible, in case we ever want to explore statically
>     reporting diagnostics from other sanitizers.
>     >>
>     >> * The classes of UB bugs we'd be able to diagnose statically
>     would always be the same as the ones we diagnose at runtime,
>     without writing much (or any) extra code.
>     >>
>     >> * It's relatively simple to implement.
>     >>
>     >> The main disadvantage are that;
>     >>
>     >> * It constitutes a layering violation, because we'd need to
>     teach LLVM about the names of some ubsan diagnostic handlers. IMO
>     this isn't too onerous, we already do this with builtins.
>     >>
>     >> * It's awkward to have llvm issue diagnostics, because this is
>     usually done by Sema/clang-tidy/the static analyzer. I think this
>     is something we could get used to.
>     >
>     > I'm not sure this last bulletpoint quite captures the
>     difficulties of having diagnostics powered by optimizers. The
>     concern is generally that this makes diagnostics unreliable and
>     surprising to users because what would appear to be unrelated
>     changes (especially to the degree of changing the optimization
>     level/tweaks, or maybe changing architecture/target) cause
>     diagnostics to appear/disappear. This is the sort of behavior
>     that's been seen with some of GCC's diagnostics that are
>     optimizer-powered, and something Clang's done a fair bit to avoid.
>
>     That's fair, I should have examined that last point in more detail.
>
>     I think we could minimize the confusion by making it clear that
>     the warnings are dependent on optimizations. In part, this would
>     mean writing good documentation. We'd also need to find the right
>     language for the warnings. Something like: "warning: foo.cpp:N:M:
>     At the -OX optimization level, undefined behavior due to <...> was
>     detected [-Woptimizer]". We could teach clang to emit Fixits in
>     some cases, or even notes which explain that the code could be
>     removed by the optimizer.
>
>
> Fixits, notes, and other high precision source information (even in 
> the case of a basic diagnostic - highlighting the expression range, 
> etc) would be pretty challenging for a feature like this, btw. The 
> source locations used by LLVM to report backend diagnostics (LLVM has 
> some - for things like instruction selection/inline asm issues, etc - 
> as well as the optimization report diagnostics) are from debug info, 
> only line/col. Stitching that back to any part of the AST would be 
> pretty ambiguous in most/many cases I'd imagine.

The source locations clang uses for inline asm diagnostics are not from 
debug info; we serialize a clang::SourceLocation into the metadata.  See 
http://llvm.org/docs/LangRef.html#inline-asm-metadata .  We could (and 
probably should) extend this to other diagnostics.

Granted, that isn't really the problem here.  The static analyzer is 
very carefully designed to generate diagnostics which are useful; it 
shows the assumptions it makes, and a dynamic path through the 
function.  With diagnostics coming from the optimizer, we have no way to 
explain why a function has undefined behavior: we throw away the CFG in 
the process of optimizing the code.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170323/2167a349/attachment.html>