[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors

Mon Jul 18 08:31:46 PDT 2016

Currently, when the last shared_ptr to an object is destroyed, libc++ 
performs two atomic decrements, one for the "strong" shared count, and 
one for the "weak" shared count.  I think we can do better than this in 
the uncontended case, but I would like some feedback for this 
optimization, particularly on the ARM side.

Here's the code change...

diff --git a/src/memory.cpp b/src/memory.cpp
index 08f2259..b459eb1 100644
--- a/src/memory.cpp
+++ b/src/memory.cpp
@@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
      return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
  }

  template <class T>
  inline T
  decrement(T& t) _NOEXCEPT
  {
      return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
  }

  }  // namespace
@@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
  void
  __shared_weak_count::__release_weak() _NOEXCEPT
  {
-    if (decrement(__shared_weak_owners_) == -1)
+    if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
+        __on_zero_shared_weak();
+    else if (decrement(__shared_weak_owners_) == -1)
          __on_zero_shared_weak();
  }

The general idea is that if the current thread is destroying the last 
weak reference, then no other thread can legally be accessing this 
object.  Given that, we can avoid an expensive atomic store. On x86_64, 
a quick-and-dirty benchmark is showing an 8% improvement in performance 
for the combination of make_shared<int> and the accompanying 
destruction.  I don't have performance numbers for other architectures 
at this point.  That 8% is pretty promising though, as the atomic 
operation improvements are showing through, despite being measured along 
with a heap allocation and deallocation.

Note that this optimization wouldn't be safe for the strong count, as 
the last strong count decrement can still contend with a 
weak_ptr::lock() call.

This comes at the cost of adding an extra load acquire for all but the 
last decrement (and sometimes even the last decrement).  On x86, this is 
really cheap (just a regular mov).  Currently with aarch64 and 32-bit 
armv8, you get an extra lda, and with armv7 you get extra barriers.

I would hope / expect that on LL/SC architectures, the first acquire 
load could be folded with the locked load in the atomic add.  The check 
and branch (inside the ll / sc loop) would then be the only overhead.  
Is this a reasonable optimization to hope for in the future on the 
compiler front?

Also, I'm being a bit conservative here by making my atomic load an 
acquire operation.  It might be safe to make the operation relaxed, but 
that seems risky to me, as __on_zero_shared_weak may end up touching 
unsynchronized data in those cases.

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project