[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors

Mon Jul 18 11:09:28 PDT 2016

I will put the code up for review, including the pair of benchmarks that 
I have authored.

For armv7a, armv8, and aarch64, I used the following reduced .cpp code 
to generate assembly.  I haven't figured out the libc++ cross compiling 
setup yet, and I'm not well set up to run benchmarks on those platforms.

struct __shared_weak_count {
   long __shared_weak_owners_;
   void __release_weak() noexcept;
   void __on_zero_shared_weak() noexcept;
};

void
__shared_weak_count::__release_weak() noexcept {
     if (__atomic_load_n(&__shared_weak_owners_, 2 /*_AO_Acquire*/) == 0)
         __on_zero_shared_weak();
     else if (__atomic_add_fetch(&__shared_weak_owners_, -1, 4 
/*_AO_Acq_Rel*/) == -1)
         __on_zero_shared_weak();
}

ARMv7a assembly, with notes:

_ZN19__shared_weak_count14__release_weakEv:
         .fnstart
@ BB#0:                                 @ %entry
         ldr     r1, [r0]  @bcraig note: it would be nice to combine 
this load with the ldrex
         dmb     ish      @bcraig note: ... and this barrier with the 
one in BB#2
         cmp     r1, #0
         it      eq         @bcraig note: unsure of why this is here...
         beq     _ZN19__shared_weak_count21__on_zero_shared_weakEv
         dmb     ish
.LBB0_1:                                @ %atomicrmw.start
                                         @ =>This Inner Loop Header: Depth=1
         ldrex   r1, [r0]
         subs    r2, r1, #1
         strex   r3, r2, [r0]
         cmp     r3, #0
         bne     .LBB0_1
@ BB#2:                                 @ %atomicrmw.end
         cmp     r1, #1
         dmb     ish
         it      ne
         bxne    lr
         b       _ZN19__shared_weak_count21__on_zero_shared_weakEv

Aarch64 assembly, with notes:
_ZN19__shared_weak_count14__release_weakEv:
         .fnstart
@ BB#0:                                 @ %entry
         lda     r1, [r0]                 @bcraig note: it would be nice 
to combine this with the ldaex
         cbz     r1, .LBB0_3
.LBB0_1:                                @ %atomicrmw.start
                                         @ =>This Inner Loop Header: Depth=1
         ldaex   r1, [r0]
         subs    r2, r1, #1
         stlex   r3, r2, [r0]
         cmp     r3, #0
         bne     .LBB0_1
@ BB#2:                                 @ %atomicrmw.end
         cmp     r1, #1
         bne     .LBB0_4
.LBB0_3:                                @ %if.then5
         b       _ZN19__shared_weak_count21__on_zero_shared_weakEv
.LBB0_4:                                @ %if.end6
         bx      lr

On 7/18/2016 12:29 PM, JF Bastien wrote:
> On Mon, Jul 18, 2016 at 8:31 AM, Craig, Ben <ben.craig at codeaurora.org 
> <mailto:ben.craig at codeaurora.org>> wrote:
>
>     Currently, when the last shared_ptr to an object is destroyed,
>     libc++ performs two atomic decrements, one for the "strong" shared
>     count, and one for the "weak" shared count.  I think we can do
>     better than this in the uncontended case, but I would like some
>     feedback for this optimization, particularly on the ARM side.
>
>     Here's the code change...
>     diff --git a/src/memory.cpp b/src/memory.cpp
>     index 08f2259..b459eb1 100644
>     --- a/src/memory.cpp
>     +++ b/src/memory.cpp
>     @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
>          return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
>      }
>
>      template <class T>
>      inline T
>      decrement(T& t) _NOEXCEPT
>      {
>          return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
>      }
>
>      }  // namespace
>     @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
>      void
>      __shared_weak_count::__release_weak() _NOEXCEPT
>      {
>     -    if (decrement(__shared_weak_owners_) == -1)
>     +    if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire)
>     == 0)
>     +        __on_zero_shared_weak();
>     +    else if (decrement(__shared_weak_owners_) == -1)
>              __on_zero_shared_weak();
>      }
>
>     The general idea is that if the current thread is destroying the
>     last weak reference, then no other thread can legally be accessing
>     this object.  Given that, we can avoid an expensive atomic store.
>     On x86_64, a quick-and-dirty benchmark is showing an 8%
>     improvement in performance for the combination of make_shared<int>
>     and the accompanying destruction.  I don't have performance
>     numbers for other architectures at this point.  That 8% is pretty
>     promising though, as the atomic operation improvements are showing
>     through, despite being measured along with a heap allocation and
>     deallocation.
>
>
> Do you have a repo with this benchmark?
>
>
>     Note that this optimization wouldn't be safe for the strong count,
>     as the last strong count decrement can still contend with a
>     weak_ptr::lock() call.
>
>     This comes at the cost of adding an extra load acquire for all but
>     the last decrement (and sometimes even the last decrement).  On
>     x86, this is really cheap (just a regular mov).  Currently with
>     aarch64 and 32-bit armv8, you get an extra lda, and with armv7 you
>     get extra barriers.
>
>     I would hope / expect that on LL/SC architectures, the first
>     acquire load could be folded with the locked load in the atomic
>     add.  The check and branch (inside the ll / sc loop) would then be
>     the only overhead.  Is this a reasonable optimization to hope for
>     in the future on the compiler front?
>
>
> What do you mean exactly, could you provide assembly? I think I 
> understand (sounds clever & doable), but assembly is easier :-)
> That can be benchmarked as well.
>
>
>     Also, I'm being a bit conservative here by making my atomic load
>     an acquire operation.  It might be safe to make the operation
>     relaxed, but that seems risky to me, as __on_zero_shared_weak may
>     end up touching unsynchronized data in those cases.
>
>
> I haven't thought enough about shared_ptr to convince myself either 
> way. Would be good to benchmark to see if it's even worth proving.

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160718/e2a435b8/attachment.html>