[libcxx-commits] [PATCH] D114119: [libcxx] Fix potential lost wake-up in counting semaphore

Tue Dec 14 12:14:15 PST 2021

jiixyj added a comment.

I've been studying the libcxx atomic "futex table" implementation in
`src/atomic.cpp` and `include/atomic` and I think I have some theories why
there still might be some deadlocks, at least under macOS.

The "futex table" in `src/atomic.cpp` contains 256 platform native futexes
(something like Linux/OpenBSD/NetBSD `futex`, macOS `__ulock_*`, FreeBSD
`umtx`, Windows `WaitOnAddress`). This table is needed because some platforms
don't natively support waiting on atomics of sizes 8/16/32/64 bits. On Linux,
for example, futexes are always 32 bit. On these platforms, calling
`atomic_wait` on an atomic of unsupported size then waits using one of
the table futexes "under the hood".

Each futex of the table has an associated "waiter count" (`__contention_state`)
which tracks the number of waiters on that futex. This count is used to avoid
calling the platform wake function needlessly.

If the platform _does_ support the size of the atomic natively then the
wait/notify syscalls (e.g. `futex`) are called directly on the atomic instead
of using one from the table. However, the "waiter count" of the table futex
is still used to minimize native wake up calls.

The part of `__libcpp_atomic_wait_backoff_impl` where the waiting actually
happens looks like this:

  auto const __monitor = __libcpp_atomic_monitor(__a);
  if(__test_fn())
      return true;
  __libcpp_atomic_wait(__a, __monitor);

...so the current value of the futex (either the table one if the size is not
natively supported, or the actual one otherwise) is saved. Then the test
function is called. If the test is unsuccessful (in case of semaphores the
atomic is "0") then we wait on the futex, telling it the `__monitor` value to
be free of races.

This will work absolutely fine if one of the table futexes is used. Waking one
of them up will always increase the futex value by one, so the futex wait will
instantly return if the current atomic value is different than `__monitor`
(which indicates that a wakeup must have happened _after_ the call to
`__test_fn` but before going to sleep). There is a corner case when there are
~4 billion wakeups in this short interval. This is mitigated by waiting for at
most 2s when the native platform futex size is only 32 bit (explained by
Olivier Giroux here: https://github.com/ogiroux/atomic_wait/issues/3).

Now, if the native atomic/futex is used (which is the case on macOS for
`counting_semaphore`!), deadlocks are _much_ more likely to occur, because now
the futex is no longer be a simple counter that always increments, but instead
the "real" value (in case of `counting_semaphore` just the semaphore value
itself). There might be executions like the following:

  - T1 calls acquire()
  - T2 calls try_acquire()
  - T3 calls release() two times
  - semaphore starts with "0"

  T1                        T2                        T3              sem
  ------------------------------------------------------------------------
                                                                       0
  ------------------------------------------------------------------------
  acquire():
  sees that sema=0
  goes into
  __libcpp_atomic_wait_backoff_impl                                    0
  ------------------------------------------------------------------------
                                                      release()        1
  ------------------------------------------------------------------------
  __monitor=1                                                          1
  ------------------------------------------------------------------------
                            try_acquire():
                            acquires one
                            successfully,
                            returns true                               0
  ------------------------------------------------------------------------
  __test_fn():
  returns false
  since sem==0                                                         0
  ------------------------------------------------------------------------
                                                      release()        1
  ------------------------------------------------------------------------
  calls
  __libcpp_atomic_wait(__a, 1);
  --> deadlock!                                                        1
  ------------------------------------------------------------------------

Here, `__libcpp_atomic_wait(__a, 0);` should have been called, because
`__test_fn()` made a decision to sleep based on the value "0" of `sem`. But
instead, the value of `__monitor` is used which is "1" and leads to this ABA
style problem where T1 sleeps although the semaphore is unlocked.

I think to solve this problem you could require the "test function" to save its
current understanding of the atomic into an in/out argument so that the atomic
wait function knows which value to use for its second argument. In case of the
semaphores, it could look like this:

  patch
       _LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
       void acquire()
       {
  -        auto const __test_fn = [this]() -> bool {
  -            auto __old = __a.load(memory_order_relaxed);
  -            return (__old != 0) && __a.compare_exchange_strong(__old, __old - 1, memory_order_acquire, memory_order_relaxed);
  +        auto const __test_fn = [this](ptrdiff_t& __old) -> bool {
  +            while (true) {
  +                if (__old == 0)
  +                    return false;
  +                if (__a.compare_exchange_weak(__old, __old - 1, memory_order_acquire, memory_order_relaxed))
  +                    return true;
  +            }
           };
  -        __cxx_atomic_wait(&__a.__a_, __test_fn);
  +        __cxx_atomic_wait_fn(&__a.__a_, __test_fn, memory_order_relaxed);
       }

For semaphores, the resulting `__old` is always "0" if "false" is returned. So
the second argument to `futex` is always "0" which should be correct.

I've been hacking on a patch which is very much WIP but I could post it if
there's interest (and if this analysis even makes sense!).

- - -

The second problem is related to the "waiter count" (`__contention_state`)
which is used to minimize platform wake calls. The only thing that must never
happen is that the wait calls happens but the wake call is not executed. But
I think this is currently not guaranteed, at least according to the C11 memory
model. I guess in practice this problem is highly unlikely to appear, but I
think it's still worth fixing to be correct without doubt.

Currently, the `__contention_state` is incremented/decremented around the
futex wait call like this (in `src/atomic.cpp`):

  __cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_seq_cst);
  // We sleep as long as the monitored value hasn't changed.
  __libcpp_platform_wait_on_address(__platform_state, __old_value);
  __cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);

...and the wake is done like this:

  if(0 != __cxx_atomic_load(__contention_state, memory_order_seq_cst))
      // We only call 'wake' if we consumed a contention bit here.
      __libcpp_platform_wake_by_address(__platform_state, __notify_one);

To visualize the problem I modeled a simple example with `herd7`
(http://diy.inria.fr/doc/herd.html). The example is a simple
`atomic_notify/atomic_wait` execution with two threads "P0" and "P1 <https://reviews.llvm.org/P1>". "P0" sets
an atomic variable from "0" to "1" and calls `atomic_notify`, and "P1 <https://reviews.llvm.org/P1>" just
calls `atomic_wait(atomic, 0)`. So "P1 <https://reviews.llvm.org/P1>" must never block for long (if it blocks
at all):

  C fut

  {}

  P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) {
    atomic_store_explicit(atomic, 1, memory_order_relaxed);

    // `atomic_notify(atomic)`
    {
      int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release);

      *do_notify = atomic_load_explicit(waiters, memory_order_seq_cst) != 0;
    }
  }

  P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) {
    int expected = 0;

    // `atomic_wait(atomic, expected)`
    {
      *monitor = atomic_load_explicit(futex, memory_order_acquire);

      int actual = atomic_load_explicit(atomic, memory_order_relaxed);
      if (actual == expected) {
        int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst);

        int success = atomic_compare_exchange_strong_explicit(futex,
          monitor, *monitor, memory_order_relaxed, memory_order_relaxed);
        *do_block = success != 0;
      }
    }
  }

  exists (1:actual=0)

`atomic` is the atomic for `atomic_notify/atomic_wait`, `futex` is the
"backing" futex from the table, and `waiters` corresponds to
`__contention_state`.

I used `herd7 -c11 -cat rc11.cat -show prop -graph columns -unshow hb,eco,co
-showinitwrites false -oneinit false -xscale 4.0 -yscale 0.8 -evince
sem.litmus` to generate graphs.

Since `herd7` does not support "read-compare-block" operations (like `futex`) I
modelled the wait with a relaxed, strong "read-compare-exchange" and look at
the return value. If and only if it's true, the `futex` call would have
blocked. I also elided the `fetch_sub` operation after the futex wait because
it isn't relevant. I think `memory_order_release` for the `fetch_sub` is still
important to avoid compiler reorderings.

`herd7` finds an execution (https://ibb.co/zrNFJNZ) where the waker reads "0" from `waiters` (so this
read is ordered before the `fetch_add` in the sequential consistency order
"psc") but the "read-compare-block" operation of the futex wait call reads "0"
from `futex` even though `futex` was increased by the waker right before the
`atomic_load_explicit`! Then, the waiter would wait (because it read "0" and
this is the same as `expected`) and the waker _wouldn't_ wake (because it read
a waiter count of "0"). This really is mind-bending to me and I think the
explanation is that the sequential consistency order "psc" does not necessarily
imply "happens-before", so "P1 <https://reviews.llvm.org/P1>" is free to read an "older" value of `futex`.

To fix this, a "read-don't-modify-write"/"load release" operation (similar to
https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) can be used to read
the `__contention_state` variable from the waker:

  if(0 != __cxx_atomic_fetch_add(__contention_state, 0, memory_order_release))
      // We only call 'wake' if we consumed a contention bit here.
      __libcpp_platform_wake_by_address(__platform_state, __notify_one);

The `seq_cst` `fetch_add` in the waiter can then be weakened to `acquire`:

  __cxx_atomic_fetch_add(__contention_state, __cxx_contention_t(1), memory_order_acquire);
  // We sleep as long as the monitored value hasn't changed.
  __libcpp_platform_wait_on_address(__platform_state, __old_value);
  __cxx_atomic_fetch_sub(__contention_state, __cxx_contention_t(1), memory_order_release);

Here is the `herd7` model of the "read-don't-modify-write" version:

  C fut

  {}

  P0 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* do_notify) {
    atomic_store_explicit(atomic, 1, memory_order_relaxed);

    // `atomic_notify(atomic)`
    {
      int tmp = atomic_fetch_add_explicit(futex, 1, memory_order_release);

      *do_notify = atomic_fetch_add_explicit(waiters, 0, memory_order_release) != 0;
    }
  }

  P1 (atomic_int* atomic, atomic_int* futex, atomic_int* waiters, int* monitor, int *do_block) {
    int expected = 0;

    // `atomic_wait(atomic, expected)`
    {
      *monitor = atomic_load_explicit(futex, memory_order_acquire);

      int actual = atomic_load_explicit(atomic, memory_order_relaxed);
      if (actual == expected) {
        int tmp = atomic_fetch_add_explicit(waiters, 1, memory_order_acquire);

        int success = atomic_compare_exchange_strong_explicit(futex,
          monitor, *monitor, memory_order_relaxed, memory_order_relaxed);
        *do_block = success != 0;
      }
    }
  }

  exists (1:actual=0)

This model disallows the execution where `do_block` is "1" and `do_notify` is
"0".

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114119/new/

https://reviews.llvm.org/D114119