[cfe-dev] Reporting UBSan issues at compile-time

Thu Mar 23 16:59:20 PDT 2017

On Thu, Mar 23, 2017 at 4:47 PM Vedant Kumar <vsk at apple.com> wrote:

>
>> On Mar 23, 2017, at 4:32 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>
>>
>>
>> On Thu, Mar 23, 2017 at 4:23 PM Vedant Kumar via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
>>
>> > On Mar 23, 2017, at 1:42 PM, Anna Zaks via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
>> >
>> >
>> >> On Mar 23, 2017, at 11:48 AM, Reid Kleckner via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
>> >>
>> >> On Thu, Mar 23, 2017 at 10:55 AM, David Blaikie <dblaikie at gmail.com>
wrote:
>> >> On Thu, Mar 23, 2017 at 10:45 AM Reid Kleckner <rnk at google.com> wrote:
>> >> I don't think it will be feasible to generalize UBSan's knowledge to
the static analyzer.
>> >>
>> >> Why not? The rough idea I meant would be to express the constraints
UBSan is checking into the static analyzer - I realize the current layering
(UBSan being in Clang's IRGen) doesn't make that trivial/obvious, but it
seems to me that the constraints could be shared in some way - with some
work.
>> >>
>> >> Maybe I am not imaginative enough, but I cannot envision a clean way
to express the conditions that trigger UB that is useful for both static
analysis and dynamic instrumentation. The best I can come up with is
AST-level instrumentation: synthesizing AST nodes that can be translated to
IR or used for analysis. That doesn't seem reasonable, so I think getting
ubsan into the static analyzer would end up duplicating the knowledge of
what conditions trigger UB.
>> >>
>> >> The static analyzer CFG is also at best an approximation of the real
CFG, especially for C++.
>> >>
>> >
>> > First, the CFG is not only used by the analyzer, but it is also used
by Sema (ex: unreachable code warnings and uninitialized variables).
>> > Second, while there are corners of C++ that are not supported, it has
high fidelity otherwise.
>> >
>> >> Sure enough - and I believe some of the people working/caring about
it would like to fix that. I think Manuel & Chandler have expressed the
notion that the best way to do that would be to move to a world where the
CFG is used for CodeGen, so it's a single/consistent source of truth.
>> >>
>> >> Yes, we could do all that work, but LLVM's CFG today is already
precise for C++. If we allow ourselves to emit diagnostics from the
middle-end, we can save all that work.
>> >>
>> >> Going down the high-effort path of extending the CFG and abstracting
or duplicating UBSan’s checks as static analyses on that CFG would
definitely provide a better diagnostic experience, but it's worth
re-examining conventional wisdom and exploring alternatives first.
>> >
>> > The idea of analysis based on top of LLVM IR is not new and have been
discussed before. My personal belief is that having access to the AST (or
just code as was written by the user) is very important.
>>
>> UBSan diagnostics provide a filename, line #, and column #. There is
enough information here for a clang-based tool to retrieve an AST node, and
generate an appropriate diagnostic. In the non-LTO case, we'd have an
ASTContext ready to go at the point where we get the diagnostics back from
LLVM. In the LTO case, we could serialize the diagnostics and pass them to
a clang tool for processing.
>>
>>
>> > It ensures we can provide precise diagnostics. It also allows us to
see when users want to suppress an issue report by changing the way the
source code is uttered. For example, allows to tell the developer that they
can suppress with a cast.
>>
>> When users of a hypothetical "-Woptimizer" need to suppress diagnostics,
they could use the no_sanitize function attribute, or sanitizer blacklists.
These are big hammers: if it proves to be too clumsy, we could find a way
to do more fine-grained sanitizer issue suppression.
>>
>>
>> > Ensuring that clang CFG completely supports C++ is a bit challenging
but not insurmountable task. In addition, fixing-up the unsupported corners
in the CFG would benefit not only the clang static analyzer, but all the
other “users” such as clang warnings, clang-tidy, possibly even refactoring
in the future.
>> >
>> > Here is a thread where this has been discussed before:
>> >
http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html
>>
>> I've tried to summarize this thread:
>>
>> --- Slides & Minutes from the Static Analyzer BoF ---
>>
>> * Cross Translation Unit Static Analysis: The proposed approach is to
generate summaries for each TU, and to teach the analyses to emit
diagnostics based on the summaries. I don't know the current status of
this, or what using summaries does to false-positive rates.
>>
>> * A Clang IR: Modeling things like temporaries is hard, so is dealing
with language changes. One proposal is to create an alternate IR for C,
then lower it to LLVM IR. The argument is that analyses would be easier to
perform on the "C/C++" IR, because CodeGen/StaticAnalyzer would operate on
the same structures. There are concerns raised about the complexity of this
undertaking
>>
>> * A Rich Type System in LLVM IR: An alternate proposal is to add a rich,
language-agnostic type system to LLVM IR, with pointers back to frontend
AST's. Under this proposal, I'm assuming LLVM would be responsible for
generating diagnostics. There were concerns raised about having
optimizations preserve the type information, about AST node lifetime
issues, and about layering violations (i.e LLVM shouldn't be dealing with
frontend stuff).
>>
>> ---
>>
>> ISTM that work on improving C++ support in the clang CFG can proceed in
parallel to work on reporting sanitizer diagnostics from the middle-end.
I'd like to know what others think about this.
>>
>> Just to recap, the approach I have in mind is different from the options
discussed in the thread Anna linked to. It's advantages are that;
>>
>> * There is a clear path towards reporting issues that arise when LTO is
enabled.
>>
>> * We can design it s.t there aren't false positives, with the caveat
that we will annoy users about bugs in dead code.
>>
>> * It's extensible, in case we ever want to explore statically reporting
diagnostics from other sanitizers.
>>
>> * The classes of UB bugs we'd be able to diagnose statically would
always be the same as the ones we diagnose at runtime, without writing much
(or any) extra code.
>>
>> * It's relatively simple to implement.
>>
>> The main disadvantage are that;
>>
>> * It constitutes a layering violation, because we'd need to teach LLVM
about the names of some ubsan diagnostic handlers. IMO this isn't too
onerous, we already do this with builtins.
>>
>> * It's awkward to have llvm issue diagnostics, because this is usually
done by Sema/clang-tidy/the static analyzer. I think this is something we
could get used to.
>
> I'm not sure this last bulletpoint quite captures the difficulties of
having diagnostics powered by optimizers. The concern is generally that
this makes diagnostics unreliable and surprising to users because what
would appear to be unrelated changes (especially to the degree of changing
the optimization level/tweaks, or maybe changing architecture/target) cause
diagnostics to appear/disappear. This is the sort of behavior that's been
seen with some of GCC's diagnostics that are optimizer-powered, and
something Clang's done a fair bit to avoid.

That's fair, I should have examined that last point in more detail.

I think we could minimize the confusion by making it clear that the
warnings are dependent on optimizations. In part, this would mean writing
good documentation. We'd also need to find the right language for the
warnings. Something like: "warning: foo.cpp:N:M: At the -OX optimization
level, undefined behavior due to <...> was detected [-Woptimizer]". We
could teach clang to emit Fixits in some cases, or even notes which explain
that the code could be removed by the optimizer.

Fixits, notes, and other high precision source information (even in the
case of a basic diagnostic - highlighting the expression range, etc) would
be pretty challenging for a feature like this, btw. The source locations
used by LLVM to report backend diagnostics (LLVM has some - for things like
instruction selection/inline asm issues, etc - as well as the optimization
report diagnostics) are from debug info, only line/col. Stitching that back
to any part of the AST would be pretty ambiguous in most/many cases I'd
imagine.

vedant
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170323/9f4c5876/attachment.html>