[libcxx-commits] [PATCH] D117366: [libcxx][libcxxabi][docs] Document the various places we use threading in libcxx and libcxxabi

Sun Jan 16 11:08:18 PST 2022

Quuxplusone added inline comments.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:17-19
+* `"If multiple threads attempt to initialize the same static local variable concurrently, the
+  initialization occurs exactly once"
+  <https://en.cppreference.com/w/cpp/language/storage_duration#Static_local_variables>`__
----------------
`Thread-safe static initialization` is the usual term for this.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:20-23
+* the `shared pointer overloads <https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic>`__ for
+  ``std::atomic_...`` cannot be easily implemented without using locks. (``std::atomic_...``
+  is part of the `Atomic Operations Library <https://en.cppreference.com/w/cpp/atomic>`__ rather
+  than the C++11 Thread Support Library)
----------------
Also `struct Big { int a[100]; }; std::atomic<Big>`, right? My ill-informed impression is that `atomic<shared_ptr>` requires DCAS (double-wide compare-and-swap) in hardware, but `atomic<Big>` requires arbitrarily wide atomics that don't exist //anywhere//.

`std::atomic<std::shared_ptr<T>>` is standard C++20. `std::atomic<Big>` is standard C++11.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:42
+
+Assuming neither ``_LIBCXXABI_HAS_NO_THREADS`` nor ``HAS_THREAD_LOCAL`` is defined, the current
+implementation for these functions is:
----------------

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:56-63
+When ``_LIBCXXABI_HAS_NO_THREADS`` is defined, both ``__cxa_get_globals()`` and 
+``__cxa_get_globals_fast()`` call a third function that simply returns a pointer to a static local
+instance of ``__cxa_eh_globals``. Without multiple threads, none of the above complexity is
+required.
+
+When ``_LIBCXXABI_HAS_NO_THREADS`` is not defined, but ``HAS_THREAD_LOCAL`` is defined, we take a
+very similar tactic, except instead of a static local copy of ``__cxa_eh_globals``, we use a static
----------------
I think it would make sense to explain the basic functionality in terms of `thread_local` first. Then the no-threads version is just "that, but replace `thread_local` with `static`"; and the no-thread-local version is "that, but simulate `thread_local` by hand in the following way...". (Assuming I got that right, of course. My eyes glazed over at the first paragraph. I'm hypothesizing that it will become more comprehensible once the basic functionality is separated from the mechanics of `thread_local`-simulation.)

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:80-84
+facet in the ``locale::__imp::facets_`` vector. For reasons described `here
+<https://en.cppreference.com/w/cpp/locale/locale/id/id>`__, ``locale::id::__id_`` is 0 initialized
+at first, and only truly intialized when the corresponding facet is first added to a locale (the
+initialization occurs via the first call to ``locale::id::__get()``, which invokes
+``locale::id::__init()``).
----------------
`intialized`
"For reasons described [on a different website not run by us]" is a risky game. ;) The usual term for what they're describing is "Static Initialization Order Fiasco": we need `(what object?)->__id_` to get initialized before the first time it's used, but the first time it's used might be inside the constructor of `std::cout`, which might run before the constructor of `(what object?)` because static objects' initialization order is not generally controllable.
We actually //do// control initialization order in a couple of places in the standard library, via `__attribute__((init_priority(SOME-HIGH-NUMBER)))` a.k.a. `_LIBCPP_INIT_PRIORITY_MAX`; do we avoid the attribute here only for historical reasons?

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:90
+between multiple threads trying to initialize the same facet's id, libcxx uses ``std::call_once``
+with each ``locale::id`` having it's own ``once_flag``, and to prevent a race condition on
+the increment of ``locale::id::__next_id`` between threads trying to initialize different facets'
----------------

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:97
+----------------------
+Unlike C, in C++, static local variables (SLVs) are initialized the first time excecution passes
+through the declaration. As a result, we need to keep track of whether an SLV has been initialized
----------------
`excecution`, `responsability`, `gaurd`, `aquired`
I'd strongly prefer just spelling out `static local [variable]` on each reference, instead of introducing an idiosyncratic acronym `SLV`.
"Unlike C, C++ permits dynamic initialization of static local variables. Dynamic initializers for static locals run the first time execution passes through the declaration. So C++ needs to keep track of whether a static local has already been initialized, and avoid races if multiple threads attempt to perform the initialization at once. The Itanium C++ ABI defines three runtime functions to help with this: each thread should call `__cxa_guard_acquire` before attempting initialization, and either `__cxa_guard_abort` or `__cxa_guard_release` after initialization fails (via an exception) or succeeds, respectively."
And then we can get into how libc++abi implements those three functions in detail.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:106
+
+The guard object consists of a 'guard byte', an 'init byte', 2 unused bytes, and sometimes a 4 byte
+thread id. If the guard byte is non-zero, then initialization has been completed. Beyond that, if
----------------
"`sometimes`"? It would be useful to say when and why.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:113-117
+implementation protects the init byte using a single condition-variable & mutex pair common to all
+SLVs. Contrary to what the Itanium ABI suggests, we do not hold on to the mutex, (or any other
+mutex) after returning from ``__cxa_guard_acquire()``. Instead, we only hold onto the mutex while
+we're reading from or writing to the init byte - once a thread has started the initialization
+(``__cxa_guard_acquire()`` finished and returned 1), it no longer holds the mutex. Any other
----------------
> Contrary to what the Itanium ABI suggests, we do not hold on to the mutex (or any other mutex) after returning from `__cxa_guard_acquire()`.

I think this is a narrower-than-useful definition of "mutex." When we return from `__cxa_guard_acquire` we're definitely "holding" //something//, which we must remember to release later, and which (as long as we hold it) will block any other thread from successfully completing a `__cxa_guard_acquire` on the same //thing//; instead those threads will go to sleep until the //thing// is available again. We //could// implement this //thing// as a std::mutex, but we don't (because std::mutex is large?) so instead we implement it as ____.

You don't describe how the three primitives are implemented when `_LIBCXXABI_USE_FUTEX` //is// defined.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:120
+acquire the mutex, then determine from the init byte that initialization has already been started
+by another thread, and wait on the condition variable (temporarily relinquishing the mutex, but not
+yet returning from ``__cxa_guard_acquire``).
----------------
IIUC: There is only one std::condition_variable, global to the whole program. Any thread which is waiting to attempt initialization of //any// static local (waiting [its turn to climb the glass hill](https://quuxplusone.github.io/blog/2020/10/23/once-flag/)) will be blocked on that std::condition_variable. When anyone succeeds at their initialization attempt, they notify_all on the condition variable (because they want to unblock all the waiters associated with their own static local). When anyone fails at their initialization attempt, they notify_all on the condition variable (because they want to unblock just one waiter associated with their own static local, but the condition variable's waiter list might be a mix of threads associated with many different static locals so they don't know who to wake).
This is all gotten from your description; I didn't try to correlate it with the actual code. Did I get the right impression?

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:128
+
+Shared pointer overloads of ``std::atomic_...``
+-----------------------------------------------
----------------
Throughout: If you mean `shared_ptr`, please say `shared_ptr`. "Shared //pointer//" means a pointer (`T*`) that is shared.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:151-157
+`P0718R2 <https://wg21.link/P0718R2>`__ introduces another option for performing atomic operations
+on ``shared_ptr``\s to C++20 with ``std::atomic<std::shared_ptr<T>>`` (`See cppreference
+<https://en.cppreference.com/w/cpp/memory/shared_ptr/atomic2>`__). This will still need some
+form of lock, but unlike with the standalone ``std::atomic_...`` functions, it will be easy to use
+a lock specific to that particular pointer-to-shared_ptr instead of sharing the locks. This may
+make the existing techniques used by ``std::atomic<T>`` that don't rely on threading primatives,
+such as a rudimentary spin-lock, a more attractive implementation option.
----------------
I don't particularly understand this section. My best guess is that you're speculating that `std::atomic<shared_ptr<T>>` might hold a data member of type `std::mutex`; but I expect/hope that we'd never do that, because it's (not standard-mandated but) really important for QoI that `sizeof(std::atomic<T>) == sizeof(T)`, and I don't see any reason for `shared_ptr` to be the unique exception to that rule. It would also just be super confusing if `atomic_foo(shared_ptr*)` and `atomic<shared_ptr>` used two different synchronization mechanisms.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:175-179
+To avoid generating the same sequence every time the two-argument form of ``std::random_shuffle``
+is called, ``std::__rs_default`` needs to preserve some state after destruction. It does this by
+using a random number engine with static storage duration (specifically a static local instance of
+``std::mt19937`` in ``__rs_default::operator()()``). However, the Mersenne Twister engine is not
+thread safe, so ``__rs_default`` needs to protect it using a mutex.
----------------
I'd say:
> `std::random_shuffle` depends on a global pseudorandom number generator (PRNG), similar to `std::rand()`. Calls to the global PRNG must be synchronized to avoid data races. The dylib provides a function named `__rs_get()` which returns an object of type `__rs_default`. `__rs_default::operator()` calls the `operator()` of a static local `std::mt19937`, which is protected by the global mutex `__rs_mut`. `__rs_default`'s constructor locks `__rs_mut`; `__rs_default`'s destructor unlocks `__rs_mut`. Note that this means libc++'s `std::random_shuffle` is not reentrant.

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:250-251
+local instance of ``__cxa_eh_globals``, similar to what we do when ``_LIBCXXABI_HAS_NO_THREADS`` is
+defined. When the threading library is available or becomes available there are a couple reasonable
+choices of implementation.
+
----------------
This sounds like a "different people do different things" cop-out. Are you trying to describe what //we actually do// (if so, you need only describe the actual implementation, not the choices we didn't make)? Or is your target audience an implementor of a new platform, where you think they //should// do something different (if so, you need only describe the thing you think they should do)?

================
Comment at: libcxx/docs/DesignDocs/InternalThreadSynchronization.rst:257-259
+loaded while exception handling is ongoing. However, cxa_exception was not written with this
+behaviour in mind, so this could produce unexpected results and may become unsafe if
+cxa_exception changes.
----------------
This sounds very much like "However, cxa_exception was not written with this behaviour in mind, so don't do that." — and then this whole paragraph can be removed!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117366/new/

https://reviews.llvm.org/D117366