[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors
Craig, Ben via cfe-dev
cfe-dev at lists.llvm.org
Mon Jul 18 11:09:28 PDT 2016
I will put the code up for review, including the pair of benchmarks that
I have authored.
For armv7a, armv8, and aarch64, I used the following reduced .cpp code
to generate assembly. I haven't figured out the libc++ cross compiling
setup yet, and I'm not well set up to run benchmarks on those platforms.
struct __shared_weak_count {
long __shared_weak_owners_;
void __release_weak() noexcept;
void __on_zero_shared_weak() noexcept;
};
void
__shared_weak_count::__release_weak() noexcept {
if (__atomic_load_n(&__shared_weak_owners_, 2 /*_AO_Acquire*/) == 0)
__on_zero_shared_weak();
else if (__atomic_add_fetch(&__shared_weak_owners_, -1, 4
/*_AO_Acq_Rel*/) == -1)
__on_zero_shared_weak();
}
ARMv7a assembly, with notes:
_ZN19__shared_weak_count14__release_weakEv:
.fnstart
@ BB#0: @ %entry
ldr r1, [r0] @bcraig note: it would be nice to combine
this load with the ldrex
dmb ish @bcraig note: ... and this barrier with the
one in BB#2
cmp r1, #0
it eq @bcraig note: unsure of why this is here...
beq _ZN19__shared_weak_count21__on_zero_shared_weakEv
dmb ish
.LBB0_1: @ %atomicrmw.start
@ =>This Inner Loop Header: Depth=1
ldrex r1, [r0]
subs r2, r1, #1
strex r3, r2, [r0]
cmp r3, #0
bne .LBB0_1
@ BB#2: @ %atomicrmw.end
cmp r1, #1
dmb ish
it ne
bxne lr
b _ZN19__shared_weak_count21__on_zero_shared_weakEv
Aarch64 assembly, with notes:
_ZN19__shared_weak_count14__release_weakEv:
.fnstart
@ BB#0: @ %entry
lda r1, [r0] @bcraig note: it would be nice
to combine this with the ldaex
cbz r1, .LBB0_3
.LBB0_1: @ %atomicrmw.start
@ =>This Inner Loop Header: Depth=1
ldaex r1, [r0]
subs r2, r1, #1
stlex r3, r2, [r0]
cmp r3, #0
bne .LBB0_1
@ BB#2: @ %atomicrmw.end
cmp r1, #1
bne .LBB0_4
.LBB0_3: @ %if.then5
b _ZN19__shared_weak_count21__on_zero_shared_weakEv
.LBB0_4: @ %if.end6
bx lr
On 7/18/2016 12:29 PM, JF Bastien wrote:
> On Mon, Jul 18, 2016 at 8:31 AM, Craig, Ben <ben.craig at codeaurora.org
> <mailto:ben.craig at codeaurora.org>> wrote:
>
> Currently, when the last shared_ptr to an object is destroyed,
> libc++ performs two atomic decrements, one for the "strong" shared
> count, and one for the "weak" shared count. I think we can do
> better than this in the uncontended case, but I would like some
> feedback for this optimization, particularly on the ARM side.
>
> Here's the code change...
> diff --git a/src/memory.cpp b/src/memory.cpp
> index 08f2259..b459eb1 100644
> --- a/src/memory.cpp
> +++ b/src/memory.cpp
> @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
> return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
> }
>
> template <class T>
> inline T
> decrement(T& t) _NOEXCEPT
> {
> return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
> }
>
> } // namespace
> @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
> void
> __shared_weak_count::__release_weak() _NOEXCEPT
> {
> - if (decrement(__shared_weak_owners_) == -1)
> + if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire)
> == 0)
> + __on_zero_shared_weak();
> + else if (decrement(__shared_weak_owners_) == -1)
> __on_zero_shared_weak();
> }
>
> The general idea is that if the current thread is destroying the
> last weak reference, then no other thread can legally be accessing
> this object. Given that, we can avoid an expensive atomic store.
> On x86_64, a quick-and-dirty benchmark is showing an 8%
> improvement in performance for the combination of make_shared<int>
> and the accompanying destruction. I don't have performance
> numbers for other architectures at this point. That 8% is pretty
> promising though, as the atomic operation improvements are showing
> through, despite being measured along with a heap allocation and
> deallocation.
>
>
> Do you have a repo with this benchmark?
>
>
> Note that this optimization wouldn't be safe for the strong count,
> as the last strong count decrement can still contend with a
> weak_ptr::lock() call.
>
> This comes at the cost of adding an extra load acquire for all but
> the last decrement (and sometimes even the last decrement). On
> x86, this is really cheap (just a regular mov). Currently with
> aarch64 and 32-bit armv8, you get an extra lda, and with armv7 you
> get extra barriers.
>
> I would hope / expect that on LL/SC architectures, the first
> acquire load could be folded with the locked load in the atomic
> add. The check and branch (inside the ll / sc loop) would then be
> the only overhead. Is this a reasonable optimization to hope for
> in the future on the compiler front?
>
>
> What do you mean exactly, could you provide assembly? I think I
> understand (sounds clever & doable), but assembly is easier :-)
> That can be benchmarked as well.
>
>
> Also, I'm being a bit conservative here by making my atomic load
> an acquire operation. It might be safe to make the operation
> relaxed, but that seems risky to me, as __on_zero_shared_weak may
> end up touching unsynchronized data in those cases.
>
>
> I haven't thought enough about shared_ptr to convince myself either
> way. Would be good to benchmark to see if it's even worth proving.
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160718/e2a435b8/attachment.html>
More information about the cfe-dev
mailing list