<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 3/23/2017 4:59 PM, David Blaikie via

      cfe-dev wrote:<br>

    </div>

    <blockquote

cite="mid:CAENS6Eup_Kmz2xgNktsm7NJ1T6Pwpao_HPYWxgNh7UZDmWEGYg@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div dir="ltr" class="gmail_msg">On Thu, Mar 23, 2017 at 4:47 PM

          Vedant Kumar <<a moz-do-not-send="true"

            href="mailto:vsk@apple.com" class="gmail_msg"

            target="_blank">vsk@apple.com</a>> wrote:<br>

        </div>

        <div class="gmail_quote gmail_msg">

          <blockquote class="gmail_quote gmail_msg" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">><br

              class="gmail_msg">

            >> On Mar 23, 2017, at 4:32 PM, David Blaikie <<a

              moz-do-not-send="true" href="mailto:dblaikie@gmail.com"

              class="gmail_msg" target="_blank">dblaikie@gmail.com</a>>

            wrote:<br class="gmail_msg">

            >><br class="gmail_msg">

            >><br class="gmail_msg">

            >><br class="gmail_msg">

            >> On Thu, Mar 23, 2017 at 4:23 PM Vedant Kumar via

            cfe-dev <<a moz-do-not-send="true"

              href="mailto:cfe-dev@lists.llvm.org" class="gmail_msg"

              target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> > On Mar 23, 2017, at 1:42 PM, Anna Zaks via

            cfe-dev <<a moz-do-not-send="true"

              href="mailto:cfe-dev@lists.llvm.org" class="gmail_msg"

              target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br

              class="gmail_msg">

            >> ><br class="gmail_msg">

            >> ><br class="gmail_msg">

            >> >> On Mar 23, 2017, at 11:48 AM, Reid

            Kleckner via cfe-dev <<a moz-do-not-send="true"

              href="mailto:cfe-dev@lists.llvm.org" class="gmail_msg"

              target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br

              class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> On Thu, Mar 23, 2017 at 10:55 AM, David

            Blaikie <<a moz-do-not-send="true"

              href="mailto:dblaikie@gmail.com" class="gmail_msg"

              target="_blank">dblaikie@gmail.com</a>> wrote:<br

              class="gmail_msg">

            >> >> On Thu, Mar 23, 2017 at 10:45 AM Reid

            Kleckner <<a moz-do-not-send="true"

              href="mailto:rnk@google.com" class="gmail_msg"

              target="_blank">rnk@google.com</a>> wrote:<br

              class="gmail_msg">

            >> >> I don't think it will be feasible to

            generalize UBSan's knowledge to the static analyzer.<br

              class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> Why not? The rough idea I meant would be

            to express the constraints UBSan is checking into the static

            analyzer - I realize the current layering (UBSan being in

            Clang's IRGen) doesn't make that trivial/obvious, but it

            seems to me that the constraints could be shared in some way

            - with some work.<br class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> Maybe I am not imaginative enough, but I

            cannot envision a clean way to express the conditions that

            trigger UB that is useful for both static analysis and

            dynamic instrumentation. The best I can come up with is

            AST-level instrumentation: synthesizing AST nodes that can

            be translated to IR or used for analysis. That doesn't seem

            reasonable, so I think getting ubsan into the static

            analyzer would end up duplicating the knowledge of what

            conditions trigger UB.<br class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> The static analyzer CFG is also at best an

            approximation of the real CFG, especially for C++.<br

              class="gmail_msg">

            >> >><br class="gmail_msg">

            >> ><br class="gmail_msg">

            >> > First, the CFG is not only used by the

            analyzer, but it is also used by Sema (ex: unreachable code

            warnings and uninitialized variables).<br class="gmail_msg">

            >> > Second, while there are corners of C++ that

            are not supported, it has high fidelity otherwise.<br

              class="gmail_msg">

            >> ><br class="gmail_msg">

            >> >> Sure enough - and I believe some of the

            people working/caring about it would like to fix that. I

            think Manuel & Chandler have expressed the notion that

            the best way to do that would be to move to a world where

            the CFG is used for CodeGen, so it's a single/consistent

            source of truth.<br class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> Yes, we could do all that work, but LLVM's

            CFG today is already precise for C++. If we allow ourselves

            to emit diagnostics from the middle-end, we can save all

            that work.<br class="gmail_msg">

            >> >><br class="gmail_msg">

            >> >> Going down the high-effort path of

            extending the CFG and abstracting or duplicating UBSan’s

            checks as static analyses on that CFG would definitely

            provide a better diagnostic experience, but it's worth

            re-examining conventional wisdom and exploring alternatives

            first.<br class="gmail_msg">

            >> ><br class="gmail_msg">

            >> > The idea of analysis based on top of LLVM IR

            is not new and have been discussed before. My personal

            belief is that having access to the AST (or just code as was

            written by the user) is very important.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> UBSan diagnostics provide a filename, line #, and

            column #. There is enough information here for a clang-based

            tool to retrieve an AST node, and generate an appropriate

            diagnostic. In the non-LTO case, we'd have an ASTContext

            ready to go at the point where we get the diagnostics back

            from LLVM. In the LTO case, we could serialize the

            diagnostics and pass them to a clang tool for processing.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >><br class="gmail_msg">

            >> > It ensures we can provide precise diagnostics.

            It also allows us to see when users want to suppress an

            issue report by changing the way the source code is uttered.

            For example, allows to tell the developer that they can

            suppress with a cast.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> When users of a hypothetical "-Woptimizer" need to

            suppress diagnostics, they could use the no_sanitize

            function attribute, or sanitizer blacklists. These are big

            hammers: if it proves to be too clumsy, we could find a way

            to do more fine-grained sanitizer issue suppression.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >><br class="gmail_msg">

            >> > Ensuring that clang CFG completely supports

            C++ is a bit challenging but not insurmountable task. In

            addition, fixing-up the unsupported corners in the CFG would

            benefit not only the clang static analyzer, but all the

            other “users” such as clang warnings, clang-tidy, possibly

            even refactoring in the future.<br class="gmail_msg">

            >> ><br class="gmail_msg">

            >> > Here is a thread where this has been discussed

            before:<br class="gmail_msg">

            >> > <a moz-do-not-send="true"

href="http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html"

              rel="noreferrer" class="gmail_msg" target="_blank">http://clang-developers.42468.n3.nabble.com/LLVM-Dev-meeting-Slides-amp-Minutes-from-the-Static-Analyzer-BoF-td4048418.html</a><br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> I've tried to summarize this thread:<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> --- Slides & Minutes from the Static Analyzer

            BoF ---<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * Cross Translation Unit Static Analysis: The

            proposed approach is to generate summaries for each TU, and

            to teach the analyses to emit diagnostics based on the

            summaries. I don't know the current status of this, or what

            using summaries does to false-positive rates.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> * A Clang IR: Modeling things like temporaries is

            hard, so is dealing with language changes. One proposal is

            to create an alternate IR for C, then lower it to LLVM IR.

            The argument is that analyses would be easier to perform on

            the "C/C++" IR, because CodeGen/StaticAnalyzer would operate

            on the same structures. There are concerns raised about the

            complexity of this undertaking<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * A Rich Type System in LLVM IR: An alternate

            proposal is to add a rich, language-agnostic type system to

            LLVM IR, with pointers back to frontend AST's. Under this

            proposal, I'm assuming LLVM would be responsible for

            generating diagnostics. There were concerns raised about

            having optimizations preserve the type information, about

            AST node lifetime issues, and about layering violations (i.e

            LLVM shouldn't be dealing with frontend stuff).<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> ---<br class="gmail_msg">

            >><br class="gmail_msg">

            >> ISTM that work on improving C++ support in the

            clang CFG can proceed in parallel to work on reporting

            sanitizer diagnostics from the middle-end. I'd like to know

            what others think about this.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> Just to recap, the approach I have in mind is

            different from the options discussed in the thread Anna

            linked to. It's advantages are that;<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * There is a clear path towards reporting issues

            that arise when LTO is enabled.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * We can design it s.t there aren't false

            positives, with the caveat that we will annoy users about

            bugs in dead code.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * It's extensible, in case we ever want to explore

            statically reporting diagnostics from other sanitizers.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> * The classes of UB bugs we'd be able to diagnose

            statically would always be the same as the ones we diagnose

            at runtime, without writing much (or any) extra code.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> * It's relatively simple to implement.<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> The main disadvantage are that;<br

              class="gmail_msg">

            >><br class="gmail_msg">

            >> * It constitutes a layering violation, because we'd

            need to teach LLVM about the names of some ubsan diagnostic

            handlers. IMO this isn't too onerous, we already do this

            with builtins.<br class="gmail_msg">

            >><br class="gmail_msg">

            >> * It's awkward to have llvm issue diagnostics,

            because this is usually done by Sema/clang-tidy/the static

            analyzer. I think this is something we could get used to.<br

              class="gmail_msg">

            ><br class="gmail_msg">

            > I'm not sure this last bulletpoint quite captures the

            difficulties of having diagnostics powered by optimizers.

            The concern is generally that this makes diagnostics

            unreliable and surprising to users because what would appear

            to be unrelated changes (especially to the degree of

            changing the optimization level/tweaks, or maybe changing

            architecture/target) cause diagnostics to appear/disappear.

            This is the sort of behavior that's been seen with some of

            GCC's diagnostics that are optimizer-powered, and something

            Clang's done a fair bit to avoid.<br class="gmail_msg">

            <br class="gmail_msg">

            That's fair, I should have examined that last point in more

            detail.<br class="gmail_msg">

            <br class="gmail_msg">

            I think we could minimize the confusion by making it clear

            that the warnings are dependent on optimizations. In part,

            this would mean writing good documentation. We'd also need

            to find the right language for the warnings. Something like:

            "warning: foo.cpp:N:M: At the -OX optimization level,

            undefined behavior due to <...> was detected

            [-Woptimizer]". We could teach clang to emit Fixits in some

            cases, or even notes which explain that the code could be

            removed by the optimizer.<br class="gmail_msg">

          </blockquote>

          <div><br>

            Fixits, notes, and other high precision source information

            (even in the case of a basic diagnostic - highlighting the

            expression range, etc) would be pretty challenging for a

            feature like this, btw. The source locations used by LLVM to

            report backend diagnostics (LLVM has some - for things like

            instruction selection/inline asm issues, etc - as well as

            the optimization report diagnostics) are from debug info,

            only line/col. Stitching that back to any part of the AST

            would be pretty ambiguous in most/many cases I'd imagine.</div>

        </div>

      </div>

    </blockquote>

    <br>

    The source locations clang uses for inline asm diagnostics are not

    from debug info; we serialize a clang::SourceLocation into the

    metadata.  See <a class="moz-txt-link-freetext" href="http://llvm.org/docs/LangRef.html#inline-asm-metadata">http://llvm.org/docs/LangRef.html#inline-asm-metadata</a>

    .  We could (and probably should) extend this to other diagnostics.<br>

    <br>

    Granted, that isn't really the problem here.  The static analyzer is

    very carefully designed to generate diagnostics which are useful; it

    shows the assumptions it makes, and a dynamic path through the

    function.  With diagnostics coming from the optimizer, we have no

    way to explain why a function has undefined behavior: we throw away

    the CFG in the process of optimizing the code.<br>

    <br>

    -Eli<br>

    <pre class="moz-signature" cols="72">-- 

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project</pre>

  </body>

</html>