[cfe-dev] [analyzer] C++ and inlining.

Sun Dec 3 20:52:29 PST 2017

Thanks!

I had the "this" pointer leaking to D::D in mind but thought that the
offsetof trick would involve UB at some point. It seems that as long
there's is an actual boolean where D::D looks for it this is probably
fine, even though I'm having a hard time with proving it from the
standard wording.

Alexey

On Mon, Dec 4, 2017 at 1:16 AM, Artem Dergachev <noqnoqneo at gmail.com> wrote:
> In this particular example this happens because we may potentially have the
> constructor of D defined as
>
> D::D() {
>   bool *haveXptr = ((void *)this) - (offsetof(C, d) - offsetof(C, x));
>   *haveXptr = !*haveXptr;
> }
>
> which is valid code that changes haveX between initialization and
> assignment.
>
> However, an external function is unlikely to contain the 'this' pointer
> (pointer to variable 'c'). At least in your code generation example, where
> 'c' is a local variable and the pointer to it never escapes to the global
> scope.
>
> Generally, in the analyzer when we don't know if something can happen, we
> often prefer to assume the worst, much like in codegen. In this particular
> example i'd be for toning down the invalidation (as in
> http://lists.llvm.org/pipermail/cfe-dev/2017-November/056103.html), but
> there may be more reasons for invalidating stuff, some of which sound much
> more convincing.
>
>
> On 12/3/17 12:56 PM, Alexey Salmin wrote:
>>
>> Speaking of the code sample from the first e-mail: does anyone has an
>> explanation why it is the non-inlined constructor being a special
>> case? If you replace with an arbitrary non-inlined function call
>> between the haveX initialization and the "if" condition the analysis
>> seem to work fine. Compare assembly in [1] with the "D d;" line
>> commented out and active.
>>
>> [1] https://godbolt.org/g/jmSKT9
>>
>> Alexey
>>
>> On Tue, Nov 28, 2017 at 6:19 PM, Artem Dergachev via cfe-dev
>> <cfe-dev at lists.llvm.org> wrote:
>>>
>>>
>>> On 11/26/17 3:17 PM, Gábor Horváth wrote:
>>>>
>>>> Hi Artem!
>>>>
>>>>
>>>> On 26 November 2017 at 12:10, Artem Dergachev via cfe-dev
>>>> <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>> wrote:
>>>>
>>>>      I've observed a lot of false positives on WebKit that resulted
>>>>      from our inability to inline relatively simple functions in C++,
>>>>      specifically constructors and operator new(), due to unimplemented
>>>>      modeling. So i wanted to document, and eventually fix, a variety
>>>>      of language feature modeling we're currently lacking. Note that
>>>>      WebKit, like LLVM, is not a typical C++ codebase; it uses many
>>>>      language constructs that regular projects don't use, and uses very
>>>>      little standard library headers.
>>>>
>>>>
>>>> Great news!
>>>>
>>>>
>>>>      ---
>>>>
>>>>      Normally we expect the analyzer to work even without any inlining,
>>>>      by conservatively invalidating the program state, which eliminates
>>>>      the clearly false assumptions but sometimes causes infeasible
>>>>      paths when the invalidation was too aggressive and we start
>>>>      denoting the same value with a different symbol after invalidation
>>>>      and assume contradictory stuff on the old and the new symbol. The
>>>>      false positives resulting from aggressive invalidation are usually
>>>>      treated as less scary than the ones suppressed by invalidation,
>>>>      because they can be easily suppressed (with assertions or const
>>>>      qualifiers etc.) without loss of coverage. However, in C++ there
>>>>      are a lot of implicit function calls which cause massive
>>>>      frustration when evaluated conservatively. For example,
>>>>
>>>>        class C {
>>>>          bool haveX;
>>>>          class D d;
>>>>
>>>>        public:
>>>>          C(int *x): haveX(x != 0) {
>>>>            if (haveX)
>>>>              *x = 1; // null dereference??
>>>>          }
>>>>        };
>>>>
>>>>      In this case, 'haveX' should assume that x is non-null. The code
>>>>      eagerly splits states into {x == 0} and {x != 0}, which seems
>>>>      reasonable. However, after 'haveX' have been initialized, the
>>>>      default constructor for field 'd' kicks in. If only this
>>>>      constructor is not inlined, or any of its callees are not inlined,
>>>>      value stored in 'haveX' would be invalidated on both paths. In
>>>>      particular, the path on which the original 'haveX' is false but
>>>>      the invalidated 'haveX' is true have now opened up.
>>>>
>>>>
>>>> I observed the very same pattern on Ericsson internal codebases, so
>>>> sorting this out is definitely a huge win for many projects (not just
>>>> LLVM
>>>> and WebKit).
>>>>
>>>>
>>>>      ---
>>>>
>>>>      Inlining of the constructor itself is disabled in many cases for
>>>>      many reasons. In particular, we are currently only trying to
>>>>      inline the constructor when the target this-region is a DeclRegion
>>>>      (variable, member variable, Objective-C++ instance var), and the
>>>>      destructor is non-trivial. This cuts away a lot of cases:
>>>>
>>>>
>>>> It is not clear for my why the non-trivial destructor is a requirement
>>>> here. Do yo have any info on that?
>>>>
>>>>
>>>>      * Constructing into temporaries is disabled when destructor is
>>>>      non-trivial. At least, we should be able to inline those when the
>>>>      destructor is present at all, so that it would be evaluated
>>>>      conservatively. One thing to note here is that our CFG has been
>>>>      recently fixed, so we're much closer to fixing this properly
>>>>      nowadays. However, CFG alone is not enough to figure out which
>>>>      destructor needs to be called; for instance, if a
>>>>      lifetime-extended temporary is initialized with an operator?:
>>>>      expression, we'd need path-sensitive information to figure out
>>>>      which object to destroy.
>>>>
>>>>      * Temporaries are also special because our pass-by-value is not
>>>>      always working correctly. In particular, when we inline 'foo(c)',
>>>>      where variable 'c' is of 'class C', we first copy 'c' into a
>>>>      temporary region, and then trivial-copy it into the stack
>>>>      variable-region of the function parameter, while we should be
>>>>      constructing directly into the parameter region. We cannot
>>>>      construct directly into the parameter region because the stack
>>>>      frame has not yet been constructed, because arguments are not yet
>>>>      computed. More reading on the subject, even if a bit outdated, is
>>>>      at
>>>>
>>>>
>>>> http://clang-developers.42468.n3.nabble.com/Inlining-temporary-destructors-in-the-analyzer-td4040851.html
>>>>
>>>>
>>>> <http://clang-developers.42468.n3.nabble.com/Inlining-temporary-destructors-in-the-analyzer-td4040851.html>
>>>>      This is a hard problem, but i wasn't noticing many instances of it
>>>>      yet.
>>>>
>>>>
>>>> I am wondering what the analyzer does with copy elision? Do we model
>>>> that
>>>> somehow?
>>>>
>>>>
>>>>      * Base object region constructors are disabled when destructor is
>>>>      non-trivial. This sounds like an accidental omission.
>>>>
>>>>
>>>> Nice catch.
>>>
>>>
>>> Actually not so much. This bailout only fires for CK_Complete constructor
>>> kinds, and base-object constructors aren't "complete" in this sense, so
>>> they
>>> work fine.
>>>
>>> This might still affect a complete constructor that follows placement-new
>>> into a base-object region (i'd be shocked to see code that actually does
>>> it).
>>>
>>>>      * Construction into ElementRegion is disabled regardless of
>>>>      destructors. In particular, mass array constructions are disabled.
>>>>      There is a special AST for this case, which emulates the loop
>>>>      through the array (or return value of operator new[]) with a loop
>>>>      counter variable and all. We have no support for this whole
>>>>      language construct. Note, however, that ElementRegion is much more
>>>>      than array element; it is often used for representing casts, and
>>>>      in particular it appears in return values of a conservatively
>>>>      evaluated operator new() (i.e. element{SymRegion}) and is likely
>>>>      to appear in placement-new() return values for the same reason. So
>>>>      we should discriminate between these two cases.
>>>>
>>>>
>>>> Do you know why casts are represented this way? Is this something that
>>>> we
>>>> might want to change in the future? I think one of the reasons why
>>>> construction into ElementRegion is disabled because arrays are populated
>>>> lazily by the analyzer. And this does not work well with code that
>>>> relies on
>>>> the fact that the number of constructor/destructor invocations is the
>>>> same
>>>> as the number of elements in the array.
>>>>
>>>>
>>>>      * Constructing into return value of operator new() is disabled
>>>>      completely anyway, because there's a modeling problem that causes
>>>>      us to be unable to connect the constructor with the correct
>>>>      this-region. The CFG part of this problem was fixed by Jordan in
>>>>      2014 by adding the CFGNewAllocator element, so we now actually
>>>>      call the operator and the constructor in the correct order, but we
>>>>      still need to pass the operator new's return value to the
>>>>      constructor. Note how pointless it is to enable it, or even inline
>>>>      a custom operator new, until construction into ElementRegion is
>>>> fixed.
>>>>
>>>>      ---
>>>>
>>>>      Speaking of inlining operator new():
>>>>
>>>>      * For correct modeling in the core, it seems that the only thing
>>>>      that remains is to actually use the return value of operator new()
>>>>      and construct the new object *into* it. Sounds easy.
>>>>
>>>>      * I’m also immediately noticing the unimplemented path diagnostics
>>>>      within the inlined operator new(), which crash with
>>>>      llvm_unreachable(“not yet implemented”) whenever a visitor fires
>>>>      within it (eg. to indicate that memory allocated there leaks).
>>>>
>>>>      * MallocChecker keeps being surprised with various non-symbolic
>>>>      stuff that may come out from operator new().
>>>>      https://reviews.llvm.org/D37189 <https://reviews.llvm.org/D37189>
>>>>      doesn’t cover all cases; in particular, void *operator new(size_t)
>>>>      { static int x; return &x; } seems to instantly crash after
>>>> inlining.
>>>>
>>>>      ---
>>>>
>>>>      From now on i'd like to experiment with, first of all, disabling
>>>>      the DeclRegion bailout when possible. Then i'd try to inline
>>>>      operator new, pass its return value to the respective constructor,
>>>>      and inline said constructor. I'm not sure if i'd be able to dig
>>>>      into the temporaries and pass-by-copy problems soon.
>>>>
>>>>      And i believe that when it comes to C++ pure language constructs,
>>>>      i listed most of the problems i'm aware of, with the exception of,
>>>>      well, *exceptions* (which also need CFG work). Of course we could
>>>>      do better modeling the standard library as well.
>>>>
>>>>
>>>> Thank you for collecting these problems!
>>>>
>>>>      _______________________________________________
>>>>      cfe-dev mailing list
>>>>      cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
>>>>      http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>>      <http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev>
>>>>
>>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>