[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors

Mon Jul 18 10:29:36 PDT 2016

On Mon, Jul 18, 2016 at 8:31 AM, Craig, Ben <ben.craig at codeaurora.org>
wrote:

> Currently, when the last shared_ptr to an object is destroyed, libc++
> performs two atomic decrements, one for the "strong" shared count, and one
> for the "weak" shared count.  I think we can do better than this in the
> uncontended case, but I would like some feedback for this optimization,
> particularly on the ARM side.
>
> Here's the code change...
> diff --git a/src/memory.cpp b/src/memory.cpp
> index 08f2259..b459eb1 100644
> --- a/src/memory.cpp
> +++ b/src/memory.cpp
> @@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
>      return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
>  }
>
>  template <class T>
>  inline T
>  decrement(T& t) _NOEXCEPT
>  {
>      return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
>  }
>
>  }  // namespace
> @@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
>  void
>  __shared_weak_count::__release_weak() _NOEXCEPT
>  {
> -    if (decrement(__shared_weak_owners_) == -1)
> +    if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
> +        __on_zero_shared_weak();
> +    else if (decrement(__shared_weak_owners_) == -1)
>          __on_zero_shared_weak();
>  }
>
> The general idea is that if the current thread is destroying the last weak
> reference, then no other thread can legally be accessing this object.
> Given that, we can avoid an expensive atomic store. On x86_64, a
> quick-and-dirty benchmark is showing an 8% improvement in performance for
> the combination of make_shared<int> and the accompanying destruction.  I
> don't have performance numbers for other architectures at this point.  That
> 8% is pretty promising though, as the atomic operation improvements are
> showing through, despite being measured along with a heap allocation and
> deallocation.
>

Do you have a repo with this benchmark?

Note that this optimization wouldn't be safe for the strong count, as the
> last strong count decrement can still contend with a weak_ptr::lock() call.
>
> This comes at the cost of adding an extra load acquire for all but the
> last decrement (and sometimes even the last decrement).  On x86, this is
> really cheap (just a regular mov).  Currently with aarch64 and 32-bit
> armv8, you get an extra lda, and with armv7 you get extra barriers.
>
> I would hope / expect that on LL/SC architectures, the first acquire load
> could be folded with the locked load in the atomic add.  The check and
> branch (inside the ll / sc loop) would then be the only overhead.  Is this
> a reasonable optimization to hope for in the future on the compiler front?
>

What do you mean exactly, could you provide assembly? I think I understand
(sounds clever & doable), but assembly is easier :-)
That can be benchmarked as well.

Also, I'm being a bit conservative here by making my atomic load an acquire
> operation.  It might be safe to make the operation relaxed, but that seems
> risky to me, as __on_zero_shared_weak may end up touching unsynchronized
> data in those cases.

I haven't thought enough about shared_ptr to convince myself either way.
Would be good to benchmark to see if it's even worth proving.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160718/c208c05b/attachment.html>