[libcxx-commits] [libcxx] [libc++] Improve performance of std::atomic_flag on Windows (PR #163524)

Tue Oct 21 03:04:31 PDT 2025

RogerSanders wrote:

> > we'd still need a loop above WaitOnAddress checking the value, since as per C++20 the wait functions on atomics must not unblock spuriously, while WaitOnAddress can
> 
> we already do that. Please check the header atomic_sync.h for implementation of ‘atomic/hatch/barrier’s wait function. Basically , we check the predicate (not binary equal) in a while loop. If wait time < 4us , we spin, otherwise, we call platform dependent code (eg futex wait on Linux ). So it already handles spurious wake up in the platform code.
> 
> > WaitOnAddress can wait on 1, 2, 4, or 8 bytes in both x86-64 and 32-bit x86, so can implement most atomic::wait operations directly.
> 
> I would suggest to get familiar how Libc++ dispatch the wait based on the type. see contention_t.h. Unfortunately the current trunk behaviour is not optimal. The way it works is that, each platform defines a contention_t, (eg int32_t on Linux) if the atomic value type happens to be the same, it goes to the happy path and call platform wait directly. Otherwise, if the type does not match exactly, eg uint32_t, it will go through a global contention table, and using the proxy atomic to call the platform code. Which is ineffient. Unfortunately allowing dispatch based on the size is an ABI break (in the sense that it changed the side effect and the post condition of a function). In short, I would highly recommend to implement efficiently from the very beginning for windows.
> 
> I have a refactoring of the whole thing in progress #161086
> 
> I would suggest wait until we land that first, so you can get better wait strategy from the very beginning, without needing to fighting with the ABI macros

I'm aware of the spin loop, I was using that point to illustrate why WaitOnAddress couldn't implement the functionality "directly". I did look into the contention_t type and saw the ABI limitations here, and prepared this patch deliberately and specifically to solve 99% of the problem without causing an ABI break, believing (perhaps incorrectly) that it would be simpler and easier to get such a fix into the actual distribution in a timely manner.

I ran into this break in a real project, and I'd like to see a pathway to get the fix deployed quickly if possible, even into a minor update for an existing major version if possible, so that this project (and others in similar situations) could benefit from the issue here being addressed in the near term.

When I was making this change, I could see the potential for a more major ABI-breaking change here, which exposes the ability to wait on memory addresses of different sizes directly. This sounds like what you've done, and it would certainly be a better end result overall. Even without that though, a more targeted fix would extract 99% of the value. The current implementation is very poor, and I was seeing a 3x real world performance degradation with a simple case of synchronizing two threads under the current implementation - crippling for my purposes. This change brings that back into comparable timing with the MSVC STL implementation. Your changes will deal with some more edge cases better, and shave off a few extra cycles, but most of the value can be obtained with a simpler change up front.

I suppose what I'm saying is, I see the value in the change you're advancing, but I disagree, I think maybe the order should be reversed here? What about putting my simple change through first, which is targeted, ABI compatible, and already passing all tests and ready to go (perhaps with one or two minor commenting/naming changes as raised by other reviewers), merge it into maintenance branches for current releases if agreed, and your change comes in and builds on top of it for the next major release, adding support for waiting on addresses of 1, 2, or 4 bytes directly, without a surrogate address being required?

https://github.com/llvm/llvm-project/pull/163524