[cfe-dev] Reporting UBSan issues at compile-time
Friedman, Eli via cfe-dev
cfe-dev at lists.llvm.org
Thu Mar 23 17:55:06 PDT 2017
On 3/23/2017 4:59 PM, David Blaikie via cfe-dev wrote:
> On Thu, Mar 23, 2017 at 4:47 PM Vedant Kumar <vsk at apple.com
> <mailto:vsk at apple.com>> wrote:
>
> >
> >> On Mar 23, 2017, at 4:32 PM, David Blaikie <dblaikie at gmail.com
> <mailto:dblaikie at gmail.com>> wrote:
> >>
> >>
> >>
> >> On Thu, Mar 23, 2017 at 4:23 PM Vedant Kumar via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> >>
> >> > On Mar 23, 2017, at 1:42 PM, Anna Zaks via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> >> >
> >> >
> >> >> On Mar 23, 2017, at 11:48 AM, Reid Kleckner via cfe-dev
> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
> >> >>
> >> >> On Thu, Mar 23, 2017 at 10:55 AM, David Blaikie
> <dblaikie at gmail.com <mailto:dblaikie at gmail.com>> wrote:
> >> >> On Thu, Mar 23, 2017 at 10:45 AM Reid Kleckner
> <rnk at google.com <mailto:rnk at google.com>> wrote:
> >> >> I don't think it will be feasible to generalize UBSan's
> knowledge to the static analyzer.
> >> >>
> >> >> Why not? The rough idea I meant would be to express the
> constraints UBSan is checking into the static analyzer - I realize
> the current layering (UBSan being in Clang's IRGen) doesn't make
> that trivial/obvious, but it seems to me that the constraints
> could be shared in some way - with some work.
> >> >>
> >> >> Maybe I am not imaginative enough, but I cannot envision a
> clean way to express the conditions that trigger UB that is useful
> for both static analysis and dynamic instrumentation. The best I
> can come up with is AST-level instrumentation: synthesizing AST
> nodes that can be translated to IR or used for analysis. That
> doesn't seem reasonable, so I think getting ubsan into the static
> analyzer would end up duplicating the knowledge of what conditions
> trigger UB.
> >> >>
> >> >> The static analyzer CFG is also at best an approximation of
> the real CFG, especially for C++.
> >> >>
> >> >
> >> > First, the CFG is not only used by the analyzer, but it is
> also used by Sema (ex: unreachable code warnings and uninitialized
> variables).
> >> > Second, while there are corners of C++ that are not
> supported, it has high fidelity otherwise.
> >> >
> >> >> Sure enough - and I believe some of the people
> working/caring about it would like to fix that. I think Manuel &
> Chandler have expressed the notion that the best way to do that
> would be to move to a world where the CFG is used for CodeGen, so
> it's a single/consistent source of truth.
> >> >>
> >> >> Yes, we could do all that work, but LLVM's CFG today is
> already precise for C++. If we allow ourselves to emit diagnostics
> from the middle-end, we can save all that work.
> >> >>
> >> >> Going down the high-effort path of extending the CFG and
> abstracting or duplicating UBSan’s checks as static analyses on
> that CFG would definitely provide a better diagnostic experience,
> but it's worth re-examining conventional wisdom and exploring
> alternatives first.
> >> >
> >> > The idea of analysis based on top of LLVM IR is not new and
> have been discussed before. My personal belief is that having
> access to the AST (or just code as was written by the user) is
> very important.
> >>
> >> UBSan diagnostics provide a filename, line #, and column #.
> There is enough information here for a clang-based tool to
> retrieve an AST node, and generate an appropriate diagnostic. In
> the non-LTO case, we'd have an ASTContext ready to go at the point
> where we get the diagnostics back from LLVM. In the LTO case, we
> could serialize the diagnostics and pass them to a clang tool for
> processing.
> >>
> >>
> >> > It ensures we can provide precise diagnostics. It also allows
> us to see when users want to suppress an issue report by changing
> the way the source code is uttered. For example, allows to tell
> the developer that they can suppress with a cast.
> >>
> >> When users of a hypothetical "-Woptimizer" need to suppress
> diagnostics, they could use the no_sanitize function attribute, or
> sanitizer blacklists. These are big hammers: if it proves to be
> too clumsy, we could find a way to do more fine-grained sanitizer
> issue suppression.
> >>
> >>
> >> > Ensuring that clang CFG completely supports C++ is a bit
> challenging but not insurmountable task. In addition, fixing-up
> the unsupported corners in the CFG would benefit not only the
> clang static analyzer, but all the other “users” such as clang
> warnings, clang-tidy, possibly even refactoring in the future.
> >> >
> >> > Here is a thread where this has been discussed before:
> >> >
> http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html
> >>
> >> I've tried to summarize this thread:
> >>
> >> --- Slides & Minutes from the Static Analyzer BoF ---
> >>
> >> * Cross Translation Unit Static Analysis: The proposed approach
> is to generate summaries for each TU, and to teach the analyses to
> emit diagnostics based on the summaries. I don't know the current
> status of this, or what using summaries does to false-positive rates.
> >>
> >> * A Clang IR: Modeling things like temporaries is hard, so is
> dealing with language changes. One proposal is to create an
> alternate IR for C, then lower it to LLVM IR. The argument is that
> analyses would be easier to perform on the "C/C++" IR, because
> CodeGen/StaticAnalyzer would operate on the same structures. There
> are concerns raised about the complexity of this undertaking
> >>
> >> * A Rich Type System in LLVM IR: An alternate proposal is to
> add a rich, language-agnostic type system to LLVM IR, with
> pointers back to frontend AST's. Under this proposal, I'm assuming
> LLVM would be responsible for generating diagnostics. There were
> concerns raised about having optimizations preserve the type
> information, about AST node lifetime issues, and about layering
> violations (i.e LLVM shouldn't be dealing with frontend stuff).
> >>
> >> ---
> >>
> >> ISTM that work on improving C++ support in the clang CFG can
> proceed in parallel to work on reporting sanitizer diagnostics
> from the middle-end. I'd like to know what others think about this.
> >>
> >> Just to recap, the approach I have in mind is different from
> the options discussed in the thread Anna linked to. It's
> advantages are that;
> >>
> >> * There is a clear path towards reporting issues that arise
> when LTO is enabled.
> >>
> >> * We can design it s.t there aren't false positives, with the
> caveat that we will annoy users about bugs in dead code.
> >>
> >> * It's extensible, in case we ever want to explore statically
> reporting diagnostics from other sanitizers.
> >>
> >> * The classes of UB bugs we'd be able to diagnose statically
> would always be the same as the ones we diagnose at runtime,
> without writing much (or any) extra code.
> >>
> >> * It's relatively simple to implement.
> >>
> >> The main disadvantage are that;
> >>
> >> * It constitutes a layering violation, because we'd need to
> teach LLVM about the names of some ubsan diagnostic handlers. IMO
> this isn't too onerous, we already do this with builtins.
> >>
> >> * It's awkward to have llvm issue diagnostics, because this is
> usually done by Sema/clang-tidy/the static analyzer. I think this
> is something we could get used to.
> >
> > I'm not sure this last bulletpoint quite captures the
> difficulties of having diagnostics powered by optimizers. The
> concern is generally that this makes diagnostics unreliable and
> surprising to users because what would appear to be unrelated
> changes (especially to the degree of changing the optimization
> level/tweaks, or maybe changing architecture/target) cause
> diagnostics to appear/disappear. This is the sort of behavior
> that's been seen with some of GCC's diagnostics that are
> optimizer-powered, and something Clang's done a fair bit to avoid.
>
> That's fair, I should have examined that last point in more detail.
>
> I think we could minimize the confusion by making it clear that
> the warnings are dependent on optimizations. In part, this would
> mean writing good documentation. We'd also need to find the right
> language for the warnings. Something like: "warning: foo.cpp:N:M:
> At the -OX optimization level, undefined behavior due to <...> was
> detected [-Woptimizer]". We could teach clang to emit Fixits in
> some cases, or even notes which explain that the code could be
> removed by the optimizer.
>
>
> Fixits, notes, and other high precision source information (even in
> the case of a basic diagnostic - highlighting the expression range,
> etc) would be pretty challenging for a feature like this, btw. The
> source locations used by LLVM to report backend diagnostics (LLVM has
> some - for things like instruction selection/inline asm issues, etc -
> as well as the optimization report diagnostics) are from debug info,
> only line/col. Stitching that back to any part of the AST would be
> pretty ambiguous in most/many cases I'd imagine.
The source locations clang uses for inline asm diagnostics are not from
debug info; we serialize a clang::SourceLocation into the metadata. See
http://llvm.org/docs/LangRef.html#inline-asm-metadata . We could (and
probably should) extend this to other diagnostics.
Granted, that isn't really the problem here. The static analyzer is
very carefully designed to generate diagnostics which are useful; it
shows the assumptions it makes, and a dynamic path through the
function. With diagnostics coming from the optimizer, we have no way to
explain why a function has undefined behavior: we throw away the CFG in
the process of optimizing the code.
-Eli
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170323/2167a349/attachment.html>
More information about the cfe-dev
mailing list