[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors
JF Bastien via cfe-dev
cfe-dev at lists.llvm.org
Mon Jul 18 10:29:36 PDT 2016
On Mon, Jul 18, 2016 at 8:31 AM, Craig, Ben <ben.craig at codeaurora.org>
wrote:
> Currently, when the last shared_ptr to an object is destroyed, libc++
> performs two atomic decrements, one for the "strong" shared count, and one
> for the "weak" shared count. I think we can do better than this in the
> uncontended case, but I would like some feedback for this optimization,
> particularly on the ARM side.
>
> Here's the code change...
> diff --git a/src/memory.cpp b/src/memory.cpp
> index 08f2259..b459eb1 100644
> --- a/src/memory.cpp
> +++ b/src/memory.cpp
> @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
> return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
> }
>
> template <class T>
> inline T
> decrement(T& t) _NOEXCEPT
> {
> return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
> }
>
> } // namespace
> @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
> void
> __shared_weak_count::__release_weak() _NOEXCEPT
> {
> - if (decrement(__shared_weak_owners_) == -1)
> + if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
> + __on_zero_shared_weak();
> + else if (decrement(__shared_weak_owners_) == -1)
> __on_zero_shared_weak();
> }
>
> The general idea is that if the current thread is destroying the last weak
> reference, then no other thread can legally be accessing this object.
> Given that, we can avoid an expensive atomic store. On x86_64, a
> quick-and-dirty benchmark is showing an 8% improvement in performance for
> the combination of make_shared<int> and the accompanying destruction. I
> don't have performance numbers for other architectures at this point. That
> 8% is pretty promising though, as the atomic operation improvements are
> showing through, despite being measured along with a heap allocation and
> deallocation.
>
Do you have a repo with this benchmark?
Note that this optimization wouldn't be safe for the strong count, as the
> last strong count decrement can still contend with a weak_ptr::lock() call.
>
> This comes at the cost of adding an extra load acquire for all but the
> last decrement (and sometimes even the last decrement). On x86, this is
> really cheap (just a regular mov). Currently with aarch64 and 32-bit
> armv8, you get an extra lda, and with armv7 you get extra barriers.
>
> I would hope / expect that on LL/SC architectures, the first acquire load
> could be folded with the locked load in the atomic add. The check and
> branch (inside the ll / sc loop) would then be the only overhead. Is this
> a reasonable optimization to hope for in the future on the compiler front?
>
What do you mean exactly, could you provide assembly? I think I understand
(sounds clever & doable), but assembly is easier :-)
That can be benchmarked as well.
Also, I'm being a bit conservative here by making my atomic load an acquire
> operation. It might be safe to make the operation relaxed, but that seems
> risky to me, as __on_zero_shared_weak may end up touching unsynchronized
> data in those cases.
I haven't thought enough about shared_ptr to convince myself either way.
Would be good to benchmark to see if it's even worth proving.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160718/c208c05b/attachment.html>
More information about the cfe-dev
mailing list