[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors
Craig, Ben via cfe-dev
cfe-dev at lists.llvm.org
Tue Jul 19 08:12:40 PDT 2016
Relevant review:
https://reviews.llvm.org/D22470
I just updated it with an extra benchmark. The performance of weak_ptr
decrements on x86 did get significantly worse, but I think that's a fair
trade off. For every weak_ptr in the wild, there's likely ten
shared_ptrs that should be unique_ptrs that will benefit from this change.
On 7/18/2016 12:29 PM, JF Bastien wrote:
> On Mon, Jul 18, 2016 at 8:31 AM, Craig, Ben <ben.craig at codeaurora.org
> <mailto:ben.craig at codeaurora.org>> wrote:
>
> Currently, when the last shared_ptr to an object is destroyed,
> libc++ performs two atomic decrements, one for the "strong" shared
> count, and one for the "weak" shared count. I think we can do
> better than this in the uncontended case, but I would like some
> feedback for this optimization, particularly on the ARM side.
>
> Here's the code change...
> diff --git a/src/memory.cpp b/src/memory.cpp
> index 08f2259..b459eb1 100644
> --- a/src/memory.cpp
> +++ b/src/memory.cpp
> @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
> return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
> }
>
> template <class T>
> inline T
> decrement(T& t) _NOEXCEPT
> {
> return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
> }
>
> } // namespace
> @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
> void
> __shared_weak_count::__release_weak() _NOEXCEPT
> {
> - if (decrement(__shared_weak_owners_) == -1)
> + if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire)
> == 0)
> + __on_zero_shared_weak();
> + else if (decrement(__shared_weak_owners_) == -1)
> __on_zero_shared_weak();
> }
>
> The general idea is that if the current thread is destroying the
> last weak reference, then no other thread can legally be accessing
> this object. Given that, we can avoid an expensive atomic store.
> On x86_64, a quick-and-dirty benchmark is showing an 8%
> improvement in performance for the combination of make_shared<int>
> and the accompanying destruction. I don't have performance
> numbers for other architectures at this point. That 8% is pretty
> promising though, as the atomic operation improvements are showing
> through, despite being measured along with a heap allocation and
> deallocation.
>
>
> Do you have a repo with this benchmark?
>
>
> Note that this optimization wouldn't be safe for the strong count,
> as the last strong count decrement can still contend with a
> weak_ptr::lock() call.
>
> This comes at the cost of adding an extra load acquire for all but
> the last decrement (and sometimes even the last decrement). On
> x86, this is really cheap (just a regular mov). Currently with
> aarch64 and 32-bit armv8, you get an extra lda, and with armv7 you
> get extra barriers.
>
> I would hope / expect that on LL/SC architectures, the first
> acquire load could be folded with the locked load in the atomic
> add. The check and branch (inside the ll / sc loop) would then be
> the only overhead. Is this a reasonable optimization to hope for
> in the future on the compiler front?
>
>
> What do you mean exactly, could you provide assembly? I think I
> understand (sounds clever & doable), but assembly is easier :-)
> That can be benchmarked as well.
>
>
> Also, I'm being a bit conservative here by making my atomic load
> an acquire operation. It might be safe to make the operation
> relaxed, but that seems risky to me, as __on_zero_shared_weak may
> end up touching unsynchronized data in those cases.
>
>
> I haven't thought enough about shared_ptr to convince myself either
> way. Would be good to benchmark to see if it's even worth proving.
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160719/98751871/attachment.html>
More information about the cfe-dev
mailing list