[cfe-dev] [libcxx] optimizing shared_ptr atomics in destructors
Craig, Ben via cfe-dev
cfe-dev at lists.llvm.org
Mon Jul 18 08:31:46 PDT 2016
Currently, when the last shared_ptr to an object is destroyed, libc++
performs two atomic decrements, one for the "strong" shared count, and
one for the "weak" shared count. I think we can do better than this in
the uncontended case, but I would like some feedback for this
optimization, particularly on the ARM side.
Here's the code change...
diff --git a/src/memory.cpp b/src/memory.cpp
index 08f2259..b459eb1 100644
--- a/src/memory.cpp
+++ b/src/memory.cpp
@@ -30,12 +30,12 @@ increment(T& t) _NOEXCEPT
return __libcpp_atomic_add(&t, 1, _AO_Relaxed);
}
template <class T>
inline T
decrement(T& t) _NOEXCEPT
{
return __libcpp_atomic_add(&t, -1, _AO_Acq_Rel);
}
} // namespace
@@ -96,7 +96,9 @@ __shared_weak_count::__release_shared() _NOEXCEPT
void
__shared_weak_count::__release_weak() _NOEXCEPT
{
- if (decrement(__shared_weak_owners_) == -1)
+ if (__libcpp_atomic_load(&__shared_weak_owners_, _AO_Acquire) == 0)
+ __on_zero_shared_weak();
+ else if (decrement(__shared_weak_owners_) == -1)
__on_zero_shared_weak();
}
The general idea is that if the current thread is destroying the last
weak reference, then no other thread can legally be accessing this
object. Given that, we can avoid an expensive atomic store. On x86_64,
a quick-and-dirty benchmark is showing an 8% improvement in performance
for the combination of make_shared<int> and the accompanying
destruction. I don't have performance numbers for other architectures
at this point. That 8% is pretty promising though, as the atomic
operation improvements are showing through, despite being measured along
with a heap allocation and deallocation.
Note that this optimization wouldn't be safe for the strong count, as
the last strong count decrement can still contend with a
weak_ptr::lock() call.
This comes at the cost of adding an extra load acquire for all but the
last decrement (and sometimes even the last decrement). On x86, this is
really cheap (just a regular mov). Currently with aarch64 and 32-bit
armv8, you get an extra lda, and with armv7 you get extra barriers.
I would hope / expect that on LL/SC architectures, the first acquire
load could be folded with the locked load in the atomic add. The check
and branch (inside the ll / sc loop) would then be the only overhead.
Is this a reasonable optimization to hope for in the future on the
compiler front?
Also, I'm being a bit conservative here by making my atomic load an
acquire operation. It might be safe to make the operation relaxed, but
that seems risky to me, as __on_zero_shared_weak may end up touching
unsynchronized data in those cases.
--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
More information about the cfe-dev
mailing list