[libcxx-commits] [PATCH] D117366: [libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabi

Wed Jan 19 09:03:46 PST 2022

DanielMcIntosh-IBM added a comment.

@Quuxplusone I've addressed the comments you made that were easy to address. The rest will take me some time to get to as I have other items I need to work on at the moment.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:20-23
+* the `shared pointer overloads <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__ for
+  ``std::atomic_...`` cannot be easily implemented without using locks. (``std::atomic_...``
+  is part of the `Atomic Operations Library <https://en.cppreference.com/w/cpp/atomic>`__ rather
+  than the C++11 Thread Support Library)
----------------
Quuxplusone wrote:
> Also `struct Big { int a[100]; }; std::atomic<Big>`, right? My ill-informed impression is that `atomic<shared_ptr>` requires DCAS (double-wide compare-and-swap) in hardware, but `atomic<Big>` requires arbitrarily wide atomics that don't exist //anywhere//.
> 
> `std::atomic<std::shared_ptr<T>>` is standard C++20. `std::atomic<Big>` is standard C++11.
First, this is about the standalone `atomic_load(const shared_ptr<_Tp>* __p)`/`atomic_store(shared_ptr<_Tp>* __p, shared_ptr<_Tp> __r)`/etc., which is standard C++11 and already implemented in libcxx, unlike `atomic<shared_ptr>`.

Second, yes, implementing `atomic<Big>` directly using builtin atomics would require arbitrarily wide atomics. They kind of exist through the [__c11_atomic builtins](https://clang.llvm.org/docs/LanguageExtensions.html#c11-atomic-builtins) and/or the [gcc __atomic builtins](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). Without these, `atomic<Big>` can still be implemented indirectly using spin-locks or regular locks.

Technically, the standalone shared pointer overloads could also use spin-locks, but at present they don't. I assumed this is because unlike `std::atomic<Big>` and `std::atomic<std::shared_ptr<T>>` they need to share the mutex(es) or spin-lock(s) across all shared-pointers, whereas `atomic<Big>` and `atomic<shared_ptr>` can put a bool or mutex inside the atomic object, making it unique to a specific `shared_ptr` or `Big`. However, looking at the description of rGd77851e8374c5a48de6e7694196b714abd673d84, it seems it was just because of a "pathological distrust of spin locks".

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:106
+
+The guard object consists of a 'guard byte', an 'init byte', 2 unused bytes, and sometimes a 4 byte
+thread id. If the guard byte is non-zero, then initialization has been completed. Beyond that, if
----------------
Quuxplusone wrote:
> "`sometimes`"? It would be useful to say when and why.
The thread-id is only used to detect recursive initialization (in a multi-threaded environment), and not super relevant to this discussion/topic, but I can include a short blurb on it if you think that would be appropriate. 

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:113-117
+implementation protects the init byte using a single condition-variable & mutex pair common to all
+SLVs. Contrary to what the Itanium ABI suggests, we do not hold on to the mutex, (or any other
+mutex) after returning from ``__cxa_guard_acquire()``. Instead, we only hold onto the mutex while
+we're reading from or writing to the init byte - once a thread has started the initialization
+(``__cxa_guard_acquire()`` finished and returned 1), it no longer holds the mutex. Any other
----------------
Quuxplusone wrote:
> > Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other mutex) after returning from `__cxa_guard_acquire()`.
> 
> I think this is a narrower-than-useful definition of "mutex." When we return from `__cxa_guard_acquire` we're definitely "holding" //something//, which we must remember to release later, and which (as long as we hold it) will block any other thread from successfully completing a `__cxa_guard_acquire` on the same //thing//; instead those threads will go to sleep until the //thing// is available again. We //could// implement this //thing// as a std::mutex, but we don't (because std::mutex is large?) so instead we implement it as ____.
> 
> You don't describe how the three primitives are implemented when `_LIBCXXABI_USE_FUTEX` //is// defined.
That's a fair point, but I think that in the context of this discussion, where the focus is on when and where we use the base threading support library (e.g. pthreads), it makes sense to limit the definition of a "mutex" to objects which we acquire/release using said threading support library (i.e. `__libcpp_mutex_t` and/or `__libcpp_recursive_mutex_t`). If we use a more broad definition of mutex/lock, it becomes a lot less clear whether an operation would require using the thread support library. I've excluded primitive spin-locks (which can be implemented using atomic operations) from the definition of a lock/mutex as well for similar reasons. If you think it's necessary, I can include a brief discussion of this choice at the beginning of the document, but I think it's pretty clear as-is.

I glossed over what happens when `_LIBCXXABI_USE_FUTEX` is defined because it doesn't appear to be in use at the moment, and doesn't rely on `__threading_support`, but I'll add a small blurb about it, and explain that it doesn't rely on the thread support library.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:120
+acquire the mutex, then determine from the init byte that initialization has already been started
+by another thread, and wait on the condition variable (temporarily relinquishing the mutex, but not
+yet returning from ``__cxa_guard_acquire``).
----------------
Quuxplusone wrote:
> IIUC: There is only one std::condition_variable, global to the whole program. Any thread which is waiting to attempt initialization of //any// static local (waiting [its turn to climb the glass hill](https://quuxplusone.github.io/blog/2020/10/23/once-flag/)) will be blocked on that std::condition_variable. When anyone succeeds at their initialization attempt, they notify_all on the condition variable (because they want to unblock all the waiters associated with their own static local). When anyone fails at their initialization attempt, they notify_all on the condition variable (because they want to unblock just one waiter associated with their own static local, but the condition variable's waiter list might be a mix of threads associated with many different static locals so they don't know who to wake).
> This is all gotten from your description; I didn't try to correlate it with the actual code. Did I get the right impression?
Yes, that is correct.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:151-157
+`P0718R2 <https://wg21.link/P0718R2>`__ introduces another option for performing atomic operations
+on ``shared_ptr``\s to C++20 with ``std::atomic<std::shared_ptr<T>>`` (`See cppreference
+<https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2>`__). This will still need some
+form of lock, but unlike with the standalone ``std::atomic_...`` functions, it will be easy to use
+a lock specific to that particular pointer-to-shared_ptr instead of sharing the locks. This may
+make the existing techniques used by ``std::atomic<T>`` that don't rely on threading primatives,
+such as a rudimentary spin-lock, a more attractive implementation option.
----------------
Quuxplusone wrote:
> I don't particularly understand this section. My best guess is that you're speculating that `std::atomic<shared_ptr<T>>` might hold a data member of type `std::mutex`; but I expect/hope that we'd never do that, because it's (not standard-mandated but) really important for QoI that `sizeof(std::atomic<T>) == sizeof(T)`, and I don't see any reason for `shared_ptr` to be the unique exception to that rule. It would also just be super confusing if `atomic_foo(shared_ptr*)` and `atomic<shared_ptr>` used two different synchronization mechanisms.
When I wrote this, I was looking at `__cxx_atomic_lock_impl`, which does not have `sizeof(std::atomic<T>) == sizeof(T)`, and uses a spin-lock as I described and suggested `std::atomic<std::shared_ptr<T>>` might also do.

Looking at it again more carefully, I see that `__cxx_atomic_lock_impl` is only used when _LIBCPP_ATOMIC_ONLY_USE_BUILTINS is defined (right now this only happens for freestanding implementations).

It seems I have also misunderstood the reason these overloads don't use a lock-free implementation (it has more to do with the `shared_ptr`'s non-trivial constructors than anything else). I'll update this whole section with a new description when I get the chance.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:250-251
+local instance of ``__cxa_eh_globals``, similar to what we do when ``_LIBCXXABI_HAS_NO_THREADS`` is
+defined. When the threading library is available or becomes available there are a couple reasonable
+choices of implementation.
+
----------------
Quuxplusone wrote:
> This sounds like a "different people do different things" cop-out. Are you trying to describe what //we actually do// (if so, you need only describe the actual implementation, not the choices we didn't make)? Or is your target audience an implementor of a new platform, where you think they //should// do something different (if so, you need only describe the thing you think they should do)?
> 
As I discuss below, I wanted to document both the current implementation, which doesn't support loading the thread library in the middle of exception handling, as well as how support for that could be added. You're right though that this wording isn't great, and I'll update it when I get the chance.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:257-259
+loaded while exception handling is ongoing. However, cxa_exception was not written with this
+behaviour in mind, so this could produce unexpected results and may become unsafe if
+cxa_exception changes.
----------------
Quuxplusone wrote:
> This sounds very much like "However, cxa_exception was not written with this behaviour in mind, so don't do that." — and then this whole paragraph can be removed!
Unfortunately, this is also probably the best and least disruptive option at the current time.

The only other option I can think of is described in the next paragraph, and to implement that well would require a lot more work (which I gloss over completely with the statement "This approach does require some method of determining whether the current thread is the main thread, however that can be accomplished in a very platform agnostic manner that is outside the scope of this document."). In the long term, I personally think this second approach will be better for us, because it allows the threading library to be loaded in the middle of exception handling, but it will be much more disruptive to libc++ and the broader llvm community. Given the push-back I've already received from Louis in D110349, I suspect that unless there is a demonstrable performance benefit to it on other platforms, I will have a hard time getting it approved.

I don't have the resources to invest in implementing and performance testing the second option on the off chance it is better and I can convince the llvm community to switch. (Especially since there are some messy situations around the user spawning threads without using std::thread - in which case what we decide is the 'main' thread might not actually be the main thread. This turns out to be a non-issue, but it gets a little messy and would require extensive comments and documentation). I did however still want to document this as an option since the AIX team (for whom this is most relevant) have yet to look at this in detail.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117366/new/

https://reviews.llvm.org/D117366