[cfe-dev] Reporting UBSan issues at compile-time

Thu Mar 23 19:32:39 PDT 2017

On Thu, Mar 23, 2017 at 5:55 PM Friedman, Eli <efriedma at codeaurora.org>
wrote:

> On 3/23/2017 4:59 PM, David Blaikie via cfe-dev wrote:
>
> On Thu, Mar 23, 2017 at 4:47 PM Vedant Kumar <vsk at apple.com> wrote:
>
> >
> >> On Mar 23, 2017, at 4:32 PM, David Blaikie <dblaikie at gmail.com> wrote:
> >>
> >>
> >>
> >> On Thu, Mar 23, 2017 at 4:23 PM Vedant Kumar via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
> >>
> >> > On Mar 23, 2017, at 1:42 PM, Anna Zaks via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
> >> >
> >> >
> >> >> On Mar 23, 2017, at 11:48 AM, Reid Kleckner via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
> >> >>
> >> >> On Thu, Mar 23, 2017 at 10:55 AM, David Blaikie <dblaikie at gmail.com>
> wrote:
> >> >> On Thu, Mar 23, 2017 at 10:45 AM Reid Kleckner <rnk at google.com>
> wrote:
> >> >> I don't think it will be feasible to generalize UBSan's knowledge to
> the static analyzer.
> >> >>
> >> >> Why not? The rough idea I meant would be to express the constraints
> UBSan is checking into the static analyzer - I realize the current layering
> (UBSan being in Clang's IRGen) doesn't make that trivial/obvious, but it
> seems to me that the constraints could be shared in some way - with some
> work.
> >> >>
> >> >> Maybe I am not imaginative enough, but I cannot envision a clean way
> to express the conditions that trigger UB that is useful for both static
> analysis and dynamic instrumentation. The best I can come up with is
> AST-level instrumentation: synthesizing AST nodes that can be translated to
> IR or used for analysis. That doesn't seem reasonable, so I think getting
> ubsan into the static analyzer would end up duplicating the knowledge of
> what conditions trigger UB.
> >> >>
> >> >> The static analyzer CFG is also at best an approximation of the real
> CFG, especially for C++.
> >> >>
> >> >
> >> > First, the CFG is not only used by the analyzer, but it is also used
> by Sema (ex: unreachable code warnings and uninitialized variables).
> >> > Second, while there are corners of C++ that are not supported, it has
> high fidelity otherwise.
> >> >
> >> >> Sure enough - and I believe some of the people working/caring about
> it would like to fix that. I think Manuel & Chandler have expressed the
> notion that the best way to do that would be to move to a world where the
> CFG is used for CodeGen, so it's a single/consistent source of truth.
> >> >>
> >> >> Yes, we could do all that work, but LLVM's CFG today is already
> precise for C++. If we allow ourselves to emit diagnostics from the
> middle-end, we can save all that work.
> >> >>
> >> >> Going down the high-effort path of extending the CFG and abstracting
> or duplicating UBSan’s checks as static analyses on that CFG would
> definitely provide a better diagnostic experience, but it's worth
> re-examining conventional wisdom and exploring alternatives first.
> >> >
> >> > The idea of analysis based on top of LLVM IR is not new and have been
> discussed before. My personal belief is that having access to the AST (or
> just code as was written by the user) is very important.
> >>
> >> UBSan diagnostics provide a filename, line #, and column #. There is
> enough information here for a clang-based tool to retrieve an AST node, and
> generate an appropriate diagnostic. In the non-LTO case, we'd have an
> ASTContext ready to go at the point where we get the diagnostics back from
> LLVM. In the LTO case, we could serialize the diagnostics and pass them to
> a clang tool for processing.
> >>
> >>
> >> > It ensures we can provide precise diagnostics. It also allows us to
> see when users want to suppress an issue report by changing the way the
> source code is uttered. For example, allows to tell the developer that they
> can suppress with a cast.
> >>
> >> When users of a hypothetical "-Woptimizer" need to suppress
> diagnostics, they could use the no_sanitize function attribute, or
> sanitizer blacklists. These are big hammers: if it proves to be too clumsy,
> we could find a way to do more fine-grained sanitizer issue suppression.
> >>
> >>
> >> > Ensuring that clang CFG completely supports C++ is a bit challenging
> but not insurmountable task. In addition, fixing-up the unsupported corners
> in the CFG would benefit not only the clang static analyzer, but all the
> other “users” such as clang warnings, clang-tidy, possibly even refactoring
> in the future.
> >> >
> >> > Here is a thread where this has been discussed before:
> >> >
> http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html
> >>
> >> I've tried to summarize this thread:
> >>
> >> --- Slides & Minutes from the Static Analyzer BoF ---
> >>
> >> * Cross Translation Unit Static Analysis: The proposed approach is to
> generate summaries for each TU, and to teach the analyses to emit
> diagnostics based on the summaries. I don't know the current status of
> this, or what using summaries does to false-positive rates.
> >>
> >> * A Clang IR: Modeling things like temporaries is hard, so is dealing
> with language changes. One proposal is to create an alternate IR for C,
> then lower it to LLVM IR. The argument is that analyses would be easier to
> perform on the "C/C++" IR, because CodeGen/StaticAnalyzer would operate on
> the same structures. There are concerns raised about the complexity of this
> undertaking
> >>
> >> * A Rich Type System in LLVM IR: An alternate proposal is to add a
> rich, language-agnostic type system to LLVM IR, with pointers back to
> frontend AST's. Under this proposal, I'm assuming LLVM would be responsible
> for generating diagnostics. There were concerns raised about having
> optimizations preserve the type information, about AST node lifetime
> issues, and about layering violations (i.e LLVM shouldn't be dealing with
> frontend stuff).
> >>
> >> ---
> >>
> >> ISTM that work on improving C++ support in the clang CFG can proceed in
> parallel to work on reporting sanitizer diagnostics from the middle-end.
> I'd like to know what others think about this.
> >>
> >> Just to recap, the approach I have in mind is different from the
> options discussed in the thread Anna linked to. It's advantages are that;
> >>
> >> * There is a clear path towards reporting issues that arise when LTO is
> enabled.
> >>
> >> * We can design it s.t there aren't false positives, with the caveat
> that we will annoy users about bugs in dead code.
> >>
> >> * It's extensible, in case we ever want to explore statically reporting
> diagnostics from other sanitizers.
> >>
> >> * The classes of UB bugs we'd be able to diagnose statically would
> always be the same as the ones we diagnose at runtime, without writing much
> (or any) extra code.
> >>
> >> * It's relatively simple to implement.
> >>
> >> The main disadvantage are that;
> >>
> >> * It constitutes a layering violation, because we'd need to teach LLVM
> about the names of some ubsan diagnostic handlers. IMO this isn't too
> onerous, we already do this with builtins.
> >>
> >> * It's awkward to have llvm issue diagnostics, because this is usually
> done by Sema/clang-tidy/the static analyzer. I think this is something we
> could get used to.
> >
> > I'm not sure this last bulletpoint quite captures the difficulties of
> having diagnostics powered by optimizers. The concern is generally that
> this makes diagnostics unreliable and surprising to users because what
> would appear to be unrelated changes (especially to the degree of changing
> the optimization level/tweaks, or maybe changing architecture/target) cause
> diagnostics to appear/disappear. This is the sort of behavior that's been
> seen with some of GCC's diagnostics that are optimizer-powered, and
> something Clang's done a fair bit to avoid.
>
> That's fair, I should have examined that last point in more detail.
>
> I think we could minimize the confusion by making it clear that the
> warnings are dependent on optimizations. In part, this would mean writing
> good documentation. We'd also need to find the right language for the
> warnings. Something like: "warning: foo.cpp:N:M: At the -OX optimization
> level, undefined behavior due to <...> was detected [-Woptimizer]". We
> could teach clang to emit Fixits in some cases, or even notes which explain
> that the code could be removed by the optimizer.
>
>
> Fixits, notes, and other high precision source information (even in the
> case of a basic diagnostic - highlighting the expression range, etc) would
> be pretty challenging for a feature like this, btw. The source locations
> used by LLVM to report backend diagnostics (LLVM has some - for things like
> instruction selection/inline asm issues, etc - as well as the optimization
> report diagnostics) are from debug info, only line/col. Stitching that back
> to any part of the AST would be pretty ambiguous in most/many cases I'd
> imagine.
>
>
> The source locations clang uses for inline asm diagnostics are not from
> debug info; we serialize a clang::SourceLocation into the metadata.  See
> http://llvm.org/docs/LangRef.html#inline-asm-metadata .  We could (and
> probably should) extend this to other diagnostics.
>

Oh, right! neat.

> Granted, that isn't really the problem here.  The static analyzer is very
> carefully designed to generate diagnostics which are useful; it shows the
> assumptions it makes, and a dynamic path through the function.  With
> diagnostics coming from the optimizer, we have no way to explain why a
> function has undefined behavior: we throw away the CFG in the process of
> optimizing the code.
>

Fair points, all. Thanks for mentioning. (& I love that part of the Static
Analyzer - walking through the sequence of execution needed to reach the
problematic case, etc)

- Dave

>
>
> -Eli
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170324/d8e75596/attachment.html>