<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/109290>109290</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [libc++] Incorrect memory order in atomic wait
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            libc++
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          aharrison24
      </td>
    </tr>
</table>

<pre>
    I think I've found an issue in `__libcpp_atomic_wait`, where an atomic operation with an overly-relaxed memory order argument could theoretically result in a lost wakeup. For what it's worth, the equivalent part of libstdc++'s atomic wait implementation uses stronger memory ordering, which I believe to be correct.

To be clear, I don't think this is a problem in practice on x86 or ARM, but according to the C++ standard it looks like a bug.

### Background

The atomic wait implementation in libc++ uses platform wait primitives such as futexes (linux) and `__ulock_wait/__ulock_wait` (macOS) to perform wait and notify operations. Because the platform wait primitives usually only work for values of a particular width (32-bit [on Linux](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/include/__atomic/contention_t.h#L23-L24)), you can't always wait on the user's value directly. The library instead does a laundering trick where internally it waits on a proxy value of the correct size. To do this, it uses the address of the user's atomic value to look up an entry in a '[contention table](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L127-L128)'. Each entry in the table contains [two atomic values of an appropriate size for the platform](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L122-L123); one representing a monotonically increasing counter which is used as the proxy for the user's value, and one holding a count of the number of threads currently waiting on the variable. The latter is used in a nifty optimisation which allows the platform `notify` call to be elided when there are no threads waiting on the value.

### The key parts of the algorithm

The algorithm that does the contention tracking is split across a few functions:
[__libcpp_contention_wait](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L151), [__libcpp_atomic_notify](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L163), [__libcpp_contention_notify](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L138)


But after inlining and removal of non-relevant parts, it essentially boils down to this pseudocode:

```c++
std::atomic<uint32_t> platform_state; // Monotonically increasing count
std::atomic<uint32_t> contention_state; // Waiter count

-------------------------------
// Notifying thread
platform_state.fetch_add(1, mo_release);      // (1) <---- Here is the problem
if (0 != contention_state.load(mo_seq_cst)) { // (2) 
 platform_notify_all(platform_state);
}

-------------------------------
// Waiting thread
contention_state.fetch_add(1, mo_seq_cst);    // (3)
platform_wait(platform_state, old_value);     // (4)
contention_state.fetch_sub(1, mo_release);    // (5)
```

The line marked "Here is the problem" corresponds to [this line in `atomic.cpp`](https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L166-L167).

### The problem

According to the C++ memory model, we should reason about code involving atomics in terms of "synchronizes-with" and "happens-before" relations, rather than potential reorderings. The contention tracking algorithm cannot be reasoned about *solely* in terms of synchronization between release-acquire pairs. It also requires the waiting thread and the notifying thread to agree on a single total order for operations `(1)`, `(2)`, `(3)` and `(4)`. The way to achieve that (according to the standard) is to make all four operations seq-cst.

Based on that, the `mo_release` on `(1)` looks insufficient. Informally, under the C++ memory model, I do not believe there is anything stopping `(1)` from 'reordering' with `(2)`. We need some sort of StoreLoad barrier to get the single total order property.

### A Herd7 simulation shows that a lost wakeup *could* occur under the C++ memory model

Inspired by @jiixyj's [Herd7](https://diy.inria.fr/doc/herd.html) simulation of a [similar contention tracking mechanism in the semaphore implementation](https://reviews.llvm.org/D114119#3193088), I thought I'd have a go here too. I'm certainly no expert with Herd, so treat this with a pinch of salt.

I've modelled the existing algorithm as:
#### `prog.litmus`
```c++
C prog

{
}

P0 (atomic_int* plat, atomic_int* waiters) {
  atomic_fetch_add_explicit(plat, 1, memory_order_release);
  int num_waiters = atomic_load_explicit(waiters, memory_order_seq_cst);
  int do_notify = (num_waiters != 0);
}

P1 (atomic_int* plat, atomic_int* waiters) {
 atomic_fetch_add_explicit(waiters, 1, memory_order_seq_cst);
  int do_wait = (atomic_load_explicit(plat, memory_order_seq_cst) == 0);
  
  // Deliberately left out - it's not relevant to the analysis
 //atomic_fetch_sub_explicit(waiters, 1, memory_order_release);
}

exists
(0:do_notify=0 /\ 1:do_wait=1)
```

It's not possible to model read-compare-wait operations in Herd. Instead I've modelled the platform wait as a sequentially consistent load; I'm genuinely unsure if that's justified. I've found [some discussion on stack overflow](https://stackoverflow.com/a/56192407) but I don't find it fully convincing. In [D114119](https://reviews.llvm.org/D114119#3193088) it's modelled as a relaxed compare-exchange-strong. That looks like it might be overkill for this situation, but I can't say for sure. Guidance on this would be welcome.

The proposition at the bottom looks for the case where thread P1 enters a waiting state, but thread P0 decides *not* to call `notify`, which could result in deadlock. The `/\` token represents logical conjunction.

Executing this in Herd with:
```
mkdir -p output
rm -f output/prog.dot
herd7 -c11 -show prop -graph columns -o output -showinitwrites false -oneinit false prog.litmus
dot -Tpdf output/prog.dot > prog.pdf
```
shows that there is one possible execution exhibiting the lost wake-up behaviour. It corresponds to thread P0 running to the end without calling `notify`, while P1 decides that a 'wait' is necessary. Crucially, even though P0 has incremented `plat` 'before' checking `waiters`, P1 observes an *older* value in the modification order of `plat` and therefore decides to enter a waiting state.

In the Herd simulation, the issue can be solved by enforcing a total order on the operations involving `plat`. Upgrading the relaxed increment of `plat` in thread P0 to have `seq_cst` memory order results in Herd reporting *no remaining executions satisfying the proposition*.

### Would a compiler ever make such reordering?
I've not found good sources on this. My suspicion is that it's unlikely to happen in practice. But I believe the standard allows it.

### Could it happen on x86?
On x86, even a [fully relaxed RMW operation has sequentially consistent behaviour](https://stackoverflow.com/a/66957641). So no, it can't happen on x86.

### Could it happen on ARM architectures?
On AArch64, the 'notify' side of the algorithm [compiles down to](https://godbolt.org/z/vh8rh6MTa) a Load-linked/Store-conditional (LL/SC) loop for incrementing the platform counter, followed by a load-acquire (`LDAR`) for reading the number of waiters. The [`STLXR` in the LL/SC loop](https://developer.arm.com/documentation/ddi0602/2024-06/Base-Instructions/STLXR--Store-release-exclusive-register-), and the following `LDAR` instruction [both have acquire-release semantics](https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions), so they will not reorder.

For what it's worth, if the `LDAR` is relaxed to a plain `LDR` (i.e. `std::memory_order_relaxed`) then Herd shows that it can reorder with the `STLXR` and result in a lost wakeup. The relevant parts of the algorithm are modelled here:
```
AArch64 Contention_Tracking
{
0:X0=plat; 0:X1=count;
1:X0=plat; 1:X1=count;
}
 P0                 | P1;
 label1:            | label2:;
 LDXR W8, [X0]      | LDAXR W8, [x1];
 ADD W8, W8, #1     | ADD W8, W8, #1;
 STLXR W9, W8, [X0] | STLXR W9, W8, [x1];
 CBNZ W9, label1    | CBNZ W9, label2;
 LDR W2, [X1]       | LDAR W3, [X0];
exists
(0:X2=0 /\ 1:X3=0)
```
You can try this in the [online Herd sim](https://diy.inria.fr/www/). The proposition at the bottom corresponds to the notify side seeing no waiters (and therefore not calling the platform notify operation), and the waiter side not seeing the increment to the platform variable.

### Could it happen on more relaxed architectures?

It looks like it is [possible on POWER](https://stackoverflow.com/a/52636008), but I haven't done any further investigation.

### Potential solutions
I think the best solution is probably to upgrade `(1)` to `mo_seq_cst`. This is what [libstdc++ does](https://github.com/gcc-mirror/gcc/blob/4af196b2ebd662c5183f1998b0184985e85479b2/libstdc%2B%2B-v3/include/bits/atomic_wait.h#L234). On x86, even a [fully relaxed RMW operation has sequentially consistent behaviour](https://stackoverflow.com/a/66957641), so the change will only affect potential compiler reorderings. On ARM, [upgrading a RMW op from `release` to `seq_cst`](https://godbolt.org/z/d1z794sz8) turns an `ldxr` into a `ldaxr` in the LL/SC loop. This ought to be far cheaper than any sort of fence.

Other options would be to turn `(2)` into a read-don't-modify-write operation (`fetch_add(0, mo_acq_rel)`) or to insert `seq_cst` fences between `(1)` & `(2)` and `(3)` and `(4)`  (along with fully relaxing the other operations). Both of which look more expensive to me.

### Concrete proposal
Update this line to be `std::memory_order_seq_cst`:
https://github.com/llvm/llvm-project/blob/7e56a092781b094307155457f129df7deda411ae/libcxx/src/atomic.cpp#L166
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMettu47iy9tOobwo2ZDo-XfRFDpP_D5BZ0-jphe69bwJKKluc0KSapOx4nn6jipRkOUmvXmsDG90IMhMdyKpiHb6vStJ7tTOIH7PFTba4-yDbUFv3UdbSOeWtEVcfCludPj5AqJV5hodMrA4IW9uaCqQB5X2LoAxky_zpSauibJonGexelU9HqUK2zDNxC8caHdLz8RbYBp0Myho4qlDTDXtAp08Th1q-YAV73Ft3AusqdCDdrt2jCVDaVlcQarQOgyql1idw6FsdSAQJ2voAR_mMbTOFe-vgWMsAKmRi5eFoXahJmFAj4PdWHaSmRRvpAtgtaFX4UJWZuOGfle-EJT1A7RuNJEQUu_XowQdnzQ7dSFpldlFjVdbwAAVqhQeEYKFAKK1zWIZplt9l-XX8_SXe0SgdvfgAlTWZWIVk8VArD8qDhMbZQuOeVG2cLIMqEayBl_USrIPrz7_T60UbQJaldZUyO9qVtL2NSoEP0lTSVaACaGufPWj1jCChaHcjmTIxjz9wI8vnnaPjHslc44-sowxZM5ky2qrRMmyt28fnG6f2KqgDGbEta5Aetm3AF_SQibVWpn3JxAakqaJjtdqWz9GhxP3oz2VOb-xl-cef9Eaw0KAbNqIVjA1qexp8zk_hBkvZemTjvCtZ61v2MGv0ibznGbbWwUHqFj05jGTXUWWrpYOjqkJNoszFpFABssWNNfDImizuMrGuQ2h8Nr_OxH0m7ncq1G0xLe0-E_daH7r_TBpn_8KS1Cy0LTJxv8LFUuYbsVrPinxzNc9Xs8XiarHazsSm2q4qrOTVbCaRFlBF-fKSiXtlSt1WyLaKx5SJ-9KagIYM8BSmdSbmj2I-eRRXmdjwzy2cbAuljM4n9VGefLSJNWyo1qPjuGATQKXIlfVpCuQNWhVOuhMo4wPKCiqL5LNatiYGBQSnyueUCZQJ6AxbVwXexNMu7OMvp7SB3fK2KWbAq79xCl8sVJaDgiRWIXoXPSeryqH33Wu9tMlP45rBsuND21DSQRNYZJCQiVW2uBlsBEEWGv_vj847Oqso87RsGjqnmVhNHmdizce0msJvsqwH2UlZFhZIeqmMJ-cLRzvSPHqsAdk0zjZOyYBsUfbp8zD4VVQWpPKcVJ7fgDUIDhuHnk7H7EDC3hobrElVQJnSofR0q7QteVfKwIriGCvKMKwl-1en89ijyaEoX9BmtdVV3IeX65zKtPsCXfzLoaw8lK1zaAJlCKlYtBQtB-kUnUoKDxlIpk4adjmjtoHSUlB75VM1ZJml1vbox8kpW-Yxj1HGI51TRUGtKqworHhbqrIOwdhewFdi6RbfyfUk6DOeOK31cST1zjoV6v2r_N_dgEBVlgM-xusQQ06Wz7S58uAbTfm4dNZTYtjiEbatKTkhk6PF1Rc3PYo4y1ec6X8Nv1zMUrI8FzUBnnRAv4agy_kbgp7Z9JcSds657dwp-fcNgZktB47RynBAmgoc7u1BavJQYw1hRjzIhOS6soCeUwUnh8Iq7aGyRxMBkfLQeGwrW9oKB9-Lv5d5_OmAIF_1oaLn5tepms5vW2XCXDyFbP5bH6RPPsiAlK-iIeH3H-aon1n67MAuF_8qFVnmbKn4e_Ljf13Q8xr_YCfg6szpIt4c6zPdYijrJ1lVmVjPyLx7-0Q2lx5TeuZ_acn40Aay-S1tB_-f632ffQnBxl3Ulp4l-DbL5nevNJ1qK2nHvX3y-P2p9CECFchWN2d7Cb7ECw7nEH37SWqdifXF6bDEyQiru__UcF9TVj032ysN3jLcmS7RcIMq8z4GepEj4n2lwi1YXT2lotWfwLDSVb_SOzL5tvjRYQ4rLYa47ALjsg5oZRD20j1jBZkQb523EBHF-caaylMQEkKhOOSXI308SwnL_BfJS8vl5HG2XGVi84OSOfLq-Pv6PQaWmOLeVqiZJiL4mlktpQaCwIVtiehWZJaD1QdOeiyVZ7SHbs-1ORPCn0xZO2vU3-gnRKTJ0MyZhKhl06DxkwK31iHdIGodq624BScJLFDlNtDYEFMlOOworI-w5a1aPtT9UhpjA6GQKDzBLBY_E9featSnTFyPhB4kjninwHBENJA8cCLL761yCI1Uzk_hgXiIt-CQL0enOo5Cj_VlaHaRysjycucQI7GgtKsJ_wcqHNxWIBQ4kELgVgXnrtS0iBfE5YV5vNCR0y7clnk02VGeeOuyjryfsFEm1q84ecfFKX8pjom9JC6uNWxtO5LM4_dJ6cddgxtJOJJBnQxdVyNb5mfhvMzp_kitxPqV8e12q0qFJkzhwVBuoRJF6zBf-6HPPhADiwefeht1inlpTqEmHX2wTUP_M9596-yemNbgZ5lYxRbQyNhT-IpgECvwdo_gbezQ_Bmsw0crKyikc4rEtLDDEO35-oCJ6aALp3di95pqU7UCr_ZtDA2KRQbeMoy7SeTR3Hwih7Zl2bp_ZaezHR-Mb5TDCooTZFf5X0q9nP5i2pEtbliEt7JdpU5TZZyS062jPy2lpxpdNa3DXpPXnMnN3YhscePVXmnp3ozbPZa1NMrvO9LocS-b2tLRjdo3b0nj8KDw6KeUb6fW7TJxfzebXc1mm0zM57PNPF-vE958gFDbdlcHbhZWUMsDgoSdBXaTYO2U7-yhREd8VZ-Ir-ALnVV0BjIKLeUtBIcyRMwWW4XQKFPWnEykHsdEak7yAWiMaQFflA_jtCXP-EbnDOQP2TJvnN1NtQr71g-V7k1EeEvetRs51uptWPGJO1SJIigTyIWooDPdHF09MqLzCeQkUNM904OJJ3xptCp7YEDrxErO_vfEvj-u6d1SygSisE9pIyDclZYntHW-ci_Lxboj_HK-bmUT7OJVM7EebRQxXv4D9PVp9r8y04-sdKbLa0P9SCHufiV13rFTJ-I7a9LbrzQH6P8bgdYdalVQukd9Ao3bAFRFJ13nmnJtT3JS-ZBG6pNXPq0UFxoZwbfFTxvhDW-5OB6OI99FzTrP5tf9iWfzu5wlWNDafIOB6_xu9mP4-DDo11jvVcEJPAYwYYpqUtp9Ix1OYh9yKInKcJKg4hX7jW_F_rizKz3BAPze9rywtMYrT4kSmGzMb1Ji2qFplaHDaI1vKUFuU51defir9UFtFVbT8TSE0i9Vq0r5svWe07KhMl8-83Rjq-3xrdTKT3QPJHQrM3G_WM424ion8Mld_WEwsFWGW_jbNmlxUKZUZkfWIDG6zPyf5_HO83pzsvG64Ux3KPhCFWWHkzgIIfwjR4MFFWCvqBIUyCZ4VoxuXMznXoU2Fpw0t3jou89exiYdGX8K_69VlTRx2hErASPmAuGIurT7cT8rQfLGesUFUEZ8UNgQ7D5J1zUAS-kxtaQTavw0AzScs2SPNHvWRUJ2z-VQYakqnlhcG8uZKdjYmjtv1g3DoDLh_G5eVaGstC2fI27kRymECCgF-8yoOHU8PWi7U6XUdNh_pa7ZSOnfXrBsEyxWfXhwyRyK3TgE98-VcjBpKNc0bWohuD1Mtt0Vcc_1sLLpZs1waVLOZjAhpMRmhsnOyYa00-3eeJjY9Hp8RhkVjk4F9LCV2iNMrEG6mP48r7i8SWUDTL401RtSADdb6I-m2r6p0xl-60GpNThkF4x2sgbwpVZFxyRwAHuTtoECa3lQtnVMQS6o6-AArjXmDNGjiQZnAie1Tvj30hM0kpN1zpOwZiZWkeuvSGSDJXov3WkKt64tVYfN8cBdXkJXtH8tfWwqEXRDZiNcjngmtuqY3wrKGiMEzJZ5VwaiOJ9mYAuP7oAE3smTra7QkS_HeUmCintbqa0qE9hkcE0UdNgvsTDHew7K2RhMl7E0Rm1xC_bXAdJ2lCYOmEtJXBG81YeIo5E4Sxk79OeQP3W5R5WiY9GDuFP4Z7NzsuoOv0tsvTEvtGMrdKcebIS02TLv6vwyHw-sY4wPYeiwsY7V51wBDvcytjN7f_TgZVC-Y7CjFJaJ63cozFdOKZIzstLoyENcJJI8Vz0jWvP7EUimmhvr1s5a4lmtK9F3GXYKv5_At75RJc9zk5-mqtAayu76FE3RNGjOZ9JTuOFcfsYOh8FzGm2o8I5Ct6yQCt26cb7dC_9H_LMLBeY8sQp2R_j5969nXxdQhLxX8vsg_3eK8nK5WayWV4RrpvAn0eDUb-4q10jun1fy-vPvIF1Zq4BlaB36c5Wvr11ZL696li9WKacIIq8VvhrVAI8y2SX6vvebHTVbFVaHBAT-zsT9oV67evn7F8njdyCyPdHKPGOViXvm35PSmordUmrCxY-PdOeWntfWNlxa-zjqnblDYmk4R6psLflCDGfJAKxv_2RinS3zx7vrz5ynNrwoxV-33jCHS_ks1dDFTbbM__zy-O1zH7YISUIW702mjQfU5DNT6fbprCtbtgMhFvdVpfJlLjJxL3JxNcmXmbi_kR4nhD9dm8ZY4p73nkyipbquFr6UuvXqQFd25H1ukrhy17yKtkhJKunNo_S0NGlW2FAnMh3N1K3PRN4EVfr_VLtZLuZzUimf5YRC-Niv0y7SVEmfz0kfNdJ50zH1Gk9wJIQX-QpnnlEEvPtFjtp27ated99HdLBE-rWMXeLHu8_pmw81xSkn4W6Acslo6O3kP6HGlIrPEEKM2k7S2F5IYvQuFMdN73xh9CWWjrPh0-tIlO6MlFCBfA-OpSCH26Fj_yW1bi76C8S9vuXZ_I7L0_wG-AIR7DgL6hjc7PK52dvP9UyPqtvlv2xFKGGgrloWqGmhy4f4huCT6J59vPv2Gb6u0xDyW54t7obnH--uz---zMh5-1ev7-7SvfSEmM_6V9-6ObzKZwdfN2f3u83p5TdvX-x-e_OP_06PRH27nS9viHNlP8NX0W0363XtlP0MX-fnwvRvvsGsv4lLSv1tTlfeY9P_FT_cgeBOPf4PMR9awyOWDl39RKfxeDzy9c0UfkylXuHirgkfi5JHpIRmLAxdoPUYJ1Ki6JDyqExcfrJ1kS7jgnEbWiNtxXixh3BJpH7N_oOMny7Ke5KxS0NvV-euj3HBehX3dnvWYQ18-uPrb5__LfovlvNlnndd1ciPKf1HnFERrZHmBNvW8ShHmQP6oHbyFTccdPzUT3q81RF2JgX6jw0RCvShv0-aNM4Wsohwr2XcjBd9_WDT6GEAxOQ88cNFzvjZ4mb0fSV_LvIvh3y7spzslXPWxT-GCd-V3M42y0JgUS2XolzM1vPtbLNZF_lsfbVZL3C9uFptChEnfHHfBe28EDeTw3z0jVyhgh-aZ-Rb3edxVxwFvxLoHGotxN5LrLj8iaLcbrEMZ-O8nhWM5np_mO5b0Wxx0_Y8SCYt0pBmmZ-NkeL5Dof7c1iymv292lz5v7mhFFpnIsdc5rp6cRHgcG3nK7K79AqyJU-KI4X40dNWOiK1sulGmBQI3ZRoi6YcB_kfHCC2iTyrbx5RgmidGY-eOqG4-5gabhOmv6cJ9zHODjrC1PMpf54m6rL8TgikGx1uwPKsShmPLlwwR5bX96PQcWRlYnkh3zB2fHcQCZxptTW7CGrOnLVLkzaZpKPJ7Og3hDAJU3O7ir-T5BSILw0awq_cnH0_gVLqDV3FkGkG9s-mkgFhmPbHI3wXuJ15WYeUfoHPAD5UH-fVZr6RH_DjbCWWq8Uy38w-1B-vVut1KbaroijlVTUv59uZLIt8Xi7WQq6qzQf1kRhDvpltZldinW-mYob5ulqK5VW-2szX2-wqx71Uuu_JfuCex8dZvhGb_APjDM_f5wsxfFWdCZEt7j64j6x60e58dpVrQhLDOkEFzV_2n722uIMH031SO2paKHP-SfeH1umP_7bhWXLPjIKFP3wU_xMAAP__rx2I2A">