[clang] [compiler-rt] [libcxx] [Clang] Support atomic operations on _BitInt(N) (PR #204815)

Sat Jun 27 00:57:42 PDT 2026

https://github.com/xroche updated https://github.com/llvm/llvm-project/pull/204815

>From 5c7fc853303e60e1c39d6ea49d5fa6ad445abc16 Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Fri, 19 Jun 2026 13:59:45 +0200
Subject: [PATCH 1/9] [Clang][POC] Atomic operations on _BitInt(N)

_BitInt(N) was rejected by every atomic path: the _Atomic(...) type
specifier, the __c11_atomic_*/__atomic_* builtins, and transitively
std::atomic. Two blanket isBitIntType() checks disabled it, dating to the
type's introduction (the __atomic builtin half is D84049). __int128, the
closest analogue, was allowed at both sites.

Lift both rejections so _BitInt flows through the normal integer path.
load, store, exchange, compare-exchange, and bitwise read-modify-write are
then correct at every width through the existing canonicalizing store and
the libcall fallback.

Arithmetic read-modify-write needs more care. A single atomicrmw on the
padded memory integer of a non-byte-aligned width (e.g. _BitInt(37) in an
i64) carries into the padding bits, leaving a non-canonical value that
breaks a later compare-exchange and gives wrong signed min/max. A wide
arithmetic fetch (e.g. _BitInt(256)) hit an llvm_unreachable in the libcall
path, a compiler crash. Both are fixed by emitting a compare-exchange loop
that computes the new value at width N via llvm::buildAtomicRMWValue and
writes back a canonical representation, reusing the existing
EmitAtomicCompareExchange helper, which selects the inline cmpxchg or the
__atomic_compare_exchange libcall by size. No-padding inline widths (64,
128) keep the direct atomicrmw fast path.

The libc++ bit-int.verify.cpp test only asserted that Clang rejection, so it
is removed as obsolete. It is not replaced with a positive test here: the
libc++ premerge matrix builds tests with a pinned clang that predates this
change. Whether libc++ should expose atomic _BitInt is a separate design
question for the P3666R4 discussion.

Verified against gcc-14: identical size and alignment for all widths, and
cross-compiler compare-exchange interop in both directions, confirming the
padding canonicalization matches.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/docs/LanguageExtensions.rst             |   9 +
 clang/docs/ReleaseNotes.rst                   |   6 +
 clang/lib/CodeGen/CGAtomic.cpp                | 211 ++++++++++++++++++
 clang/lib/Sema/SemaChecking.cpp               |   5 -
 clang/lib/Sema/SemaType.cpp                   |   2 -
 clang/test/CodeGen/atomic-bitint.c            |  90 ++++++++
 clang/test/Sema/builtins.c                    |   4 +-
 clang/test/SemaCXX/ext-int.cpp                |  10 +-
 libcxx/test/libcxx/atomics/bit-int.verify.cpp |  22 --
 9 files changed, 322 insertions(+), 37 deletions(-)
 create mode 100644 clang/test/CodeGen/atomic-bitint.c
 delete mode 100644 libcxx/test/libcxx/atomics/bit-int.verify.cpp

diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index d79d82a175c68..5ff076d3e48ad 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -451,6 +451,15 @@ favor of the standard type.
 Note: the ABI for ``_BitInt(N)`` is still in the process of being stabilized,
 so this type should not yet be used in interfaces that require ABI stability.
 
+``_BitInt(N)`` may be used as an atomic type: ``_Atomic(_BitInt(N))``, the
+``__c11_atomic_*`` and ``__atomic_*`` builtins, and ``std::atomic`` all accept
+it for any width. Widths the target cannot operate on inline are lowered to the
+``__atomic_*`` libcalls. For a width whose representation has padding bits (``N``
+not a multiple of the type's alignment, e.g. ``_BitInt(37)``), arithmetic
+read-modify-write operations are emitted as a compare-exchange loop that computes
+at width ``N``, so the result wraps modulo ``2**N`` and the padding bits stay
+canonical.
+
 C keywords supported in all language modes
 ------------------------------------------
 
diff --git a/clang/docs/ReleaseNotes.rst b/clang/docs/ReleaseNotes.rst
index 7f056abfbbe24..8692da8830dff 100644
--- a/clang/docs/ReleaseNotes.rst
+++ b/clang/docs/ReleaseNotes.rst
@@ -265,6 +265,12 @@ Non-comprehensive list of changes in this release
 - Added support for floating point and pointer values in most ``__atomic_``
   builtins.
 
+- Atomic operations on ``_BitInt(N)`` are now supported, including
+  ``_Atomic(_BitInt(N))``, the ``__c11_atomic_*`` / ``__atomic_*`` builtins, and
+  ``std::atomic``. Widths the target cannot operate on inline use the
+  ``__atomic_*`` libcalls; arithmetic read-modify-write on a width with padding
+  bits is emitted as a compare-exchange loop computing at the value width.
+
 - Added ``__builtin_stdc_rotate_left`` and ``__builtin_stdc_rotate_right``
   for bit rotation of unsigned integers including ``_BitInt`` types. Rotation
   counts are normalized modulo the bit-width and support negative values.
diff --git a/clang/lib/CodeGen/CGAtomic.cpp b/clang/lib/CodeGen/CGAtomic.cpp
index 270965b109943..66c059fd40e26 100644
--- a/clang/lib/CodeGen/CGAtomic.cpp
+++ b/clang/lib/CodeGen/CGAtomic.cpp
@@ -21,6 +21,7 @@
 #include "llvm/ADT/DenseMap.h"
 #include "llvm/IR/DataLayout.h"
 #include "llvm/IR/Intrinsics.h"
+#include "llvm/Transforms/Utils/LowerAtomic.h"
 
 using namespace clang;
 using namespace CodeGen;
@@ -558,6 +559,195 @@ static llvm::Value *EmitPostAtomicMinMax(CGBuilderTy &Builder,
   return Builder.CreateSelect(Cmp, OldVal, RHS, "newval");
 }
 
+/// Classify an atomic op as an arithmetic/bitwise read-modify-write (one that
+/// normally lowers to a single `atomicrmw`), mapping it to the matching
+/// `AtomicRMWInst::BinOp` and reporting whether the builtin returns the new
+/// value (`<op>_fetch`) rather than the old value (`fetch_<op>`). \p IsSigned
+/// selects signed vs unsigned min/max. Returns false for exchange, load, store,
+/// compare-exchange, and any non-RMW op, none of which need the _BitInt loop.
+static bool classifyBitIntRMW(AtomicExpr::AtomicOp Op, bool IsSigned,
+                              llvm::AtomicRMWInst::BinOp &BinOp,
+                              bool &ReturnsNew) {
+  using RMW = llvm::AtomicRMWInst;
+  switch (Op) {
+  case AtomicExpr::AO__c11_atomic_fetch_add:
+  case AtomicExpr::AO__hip_atomic_fetch_add:
+  case AtomicExpr::AO__opencl_atomic_fetch_add:
+  case AtomicExpr::AO__atomic_fetch_add:
+  case AtomicExpr::AO__scoped_atomic_fetch_add:
+    BinOp = RMW::Add, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_add_fetch:
+  case AtomicExpr::AO__scoped_atomic_add_fetch:
+    BinOp = RMW::Add, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_sub:
+  case AtomicExpr::AO__hip_atomic_fetch_sub:
+  case AtomicExpr::AO__opencl_atomic_fetch_sub:
+  case AtomicExpr::AO__atomic_fetch_sub:
+  case AtomicExpr::AO__scoped_atomic_fetch_sub:
+    BinOp = RMW::Sub, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_sub_fetch:
+  case AtomicExpr::AO__scoped_atomic_sub_fetch:
+    BinOp = RMW::Sub, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_and:
+  case AtomicExpr::AO__hip_atomic_fetch_and:
+  case AtomicExpr::AO__opencl_atomic_fetch_and:
+  case AtomicExpr::AO__atomic_fetch_and:
+  case AtomicExpr::AO__scoped_atomic_fetch_and:
+    BinOp = RMW::And, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_and_fetch:
+  case AtomicExpr::AO__scoped_atomic_and_fetch:
+    BinOp = RMW::And, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_or:
+  case AtomicExpr::AO__hip_atomic_fetch_or:
+  case AtomicExpr::AO__opencl_atomic_fetch_or:
+  case AtomicExpr::AO__atomic_fetch_or:
+  case AtomicExpr::AO__scoped_atomic_fetch_or:
+    BinOp = RMW::Or, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_or_fetch:
+  case AtomicExpr::AO__scoped_atomic_or_fetch:
+    BinOp = RMW::Or, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_xor:
+  case AtomicExpr::AO__hip_atomic_fetch_xor:
+  case AtomicExpr::AO__opencl_atomic_fetch_xor:
+  case AtomicExpr::AO__atomic_fetch_xor:
+  case AtomicExpr::AO__scoped_atomic_fetch_xor:
+    BinOp = RMW::Xor, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_xor_fetch:
+  case AtomicExpr::AO__scoped_atomic_xor_fetch:
+    BinOp = RMW::Xor, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_nand:
+  case AtomicExpr::AO__atomic_fetch_nand:
+  case AtomicExpr::AO__scoped_atomic_fetch_nand:
+    BinOp = RMW::Nand, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_nand_fetch:
+  case AtomicExpr::AO__scoped_atomic_nand_fetch:
+    BinOp = RMW::Nand, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_min:
+  case AtomicExpr::AO__hip_atomic_fetch_min:
+  case AtomicExpr::AO__opencl_atomic_fetch_min:
+  case AtomicExpr::AO__atomic_fetch_min:
+  case AtomicExpr::AO__scoped_atomic_fetch_min:
+    BinOp = IsSigned ? RMW::Min : RMW::UMin, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_min_fetch:
+  case AtomicExpr::AO__scoped_atomic_min_fetch:
+    BinOp = IsSigned ? RMW::Min : RMW::UMin, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__c11_atomic_fetch_max:
+  case AtomicExpr::AO__hip_atomic_fetch_max:
+  case AtomicExpr::AO__opencl_atomic_fetch_max:
+  case AtomicExpr::AO__atomic_fetch_max:
+  case AtomicExpr::AO__scoped_atomic_fetch_max:
+    BinOp = IsSigned ? RMW::Max : RMW::UMax, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_max_fetch:
+  case AtomicExpr::AO__scoped_atomic_max_fetch:
+    BinOp = IsSigned ? RMW::Max : RMW::UMax, ReturnsNew = true;
+    return true;
+  case AtomicExpr::AO__atomic_fetch_uinc:
+  case AtomicExpr::AO__scoped_atomic_fetch_uinc:
+    BinOp = RMW::UIncWrap, ReturnsNew = false;
+    return true;
+  case AtomicExpr::AO__atomic_fetch_udec:
+  case AtomicExpr::AO__scoped_atomic_fetch_udec:
+    BinOp = RMW::UDecWrap, ReturnsNew = false;
+    return true;
+  default:
+    return false;
+  }
+}
+
+/// True for a `_BitInt(N)` whose value width N differs from its in-memory width
+/// (e.g. `_BitInt(37)` occupies 64 bits), so the high bits are padding.
+static bool hasBitIntPadding(QualType T, const ASTContext &C) {
+  if (const auto *BIT = T->getAs<BitIntType>())
+    return BIT->getNumBits() != C.getTypeSize(T);
+  return false;
+}
+
+/// Map a constant C ABI memory order to an llvm ordering. A non-constant order
+/// is handled conservatively with the strongest ordering.
+static llvm::AtomicOrdering atomicOrderOrSeqCst(llvm::Value *Order) {
+  auto *C = dyn_cast<llvm::ConstantInt>(Order);
+  if (!C || !llvm::isValidAtomicOrderingCABI(C->getZExtValue()))
+    return llvm::AtomicOrdering::SequentiallyConsistent;
+  switch ((llvm::AtomicOrderingCABI)C->getZExtValue()) {
+  case llvm::AtomicOrderingCABI::relaxed:
+    return llvm::AtomicOrdering::Monotonic;
+  case llvm::AtomicOrderingCABI::consume:
+  case llvm::AtomicOrderingCABI::acquire:
+    return llvm::AtomicOrdering::Acquire;
+  case llvm::AtomicOrderingCABI::release:
+    return llvm::AtomicOrdering::Release;
+  case llvm::AtomicOrderingCABI::acq_rel:
+    return llvm::AtomicOrdering::AcquireRelease;
+  case llvm::AtomicOrderingCABI::seq_cst:
+    return llvm::AtomicOrdering::SequentiallyConsistent;
+  }
+  llvm_unreachable("invalid CABI ordering");
+}
+
+/// Emit a `_BitInt(N)` atomic read-modify-write as a compare-exchange loop. A
+/// single `atomicrmw` on the padded memory integer would carry into / compare
+/// the padding bits, and no arbitrary-width `__atomic_fetch_*` libcall exists
+/// for wide widths. The loop computes the new value at width N and writes back
+/// a canonical (extended) representation via the existing cmpxchg helper, which
+/// also picks the inline-vs-libcall form by size.
+static RValue emitBitIntAtomicRMWLoop(CodeGenFunction &CGF, AtomicExpr *E,
+                                      Address Ptr, Address Val1,
+                                      QualType AtomicTy,
+                                      llvm::AtomicRMWInst::BinOp BinOp,
+                                      bool ReturnsNew, llvm::Value *Order) {
+  QualType ValTy = E->getValueType();
+  llvm::AtomicOrdering AO = atomicOrderOrSeqCst(Order);
+  llvm::AtomicOrdering Failure =
+      llvm::AtomicCmpXchgInst::getStrongestFailureOrdering(AO);
+
+  LValue AtomicLVal = CGF.MakeAddrLValue(Ptr, AtomicTy);
+  AtomicInfo Atomics(CGF, AtomicLVal);
+
+  llvm::Value *RHS =
+      CGF.EmitLoadOfScalar(CGF.MakeAddrLValue(Val1, ValTy), E->getExprLoc());
+
+  RValue OldRV = Atomics.EmitAtomicLoad(
+      AggValueSlot::ignored(), E->getExprLoc(),
+      /*AsValue=*/true, llvm::AtomicOrdering::Monotonic, E->isVolatile());
+  llvm::Value *Init = OldRV.getScalarVal();
+
+  llvm::BasicBlock *StartBB = CGF.Builder.GetInsertBlock();
+  llvm::BasicBlock *LoopBB = CGF.createBasicBlock("atomicrmw.start", CGF.CurFn);
+  llvm::BasicBlock *EndBB = CGF.createBasicBlock("atomicrmw.end", CGF.CurFn);
+  CGF.Builder.CreateBr(LoopBB);
+  CGF.Builder.SetInsertPoint(LoopBB);
+
+  llvm::PHINode *Old = CGF.Builder.CreatePHI(Init->getType(), 2);
+  Old->addIncoming(Init, StartBB);
+
+  // Compute at the value width via the canonical RMW lowering, so the result
+  // wraps mod 2^N and never touches the padding bits.
+  llvm::Value *New = llvm::buildAtomicRMWValue(BinOp, CGF.Builder, Old, RHS);
+
+  auto Res = Atomics.EmitAtomicCompareExchange(
+      RValue::get(Old), RValue::get(New), AO, Failure, /*IsWeak=*/true);
+  Old->addIncoming(Res.first.getScalarVal(), CGF.Builder.GetInsertBlock());
+  CGF.Builder.CreateCondBr(Res.second, EndBB, LoopBB);
+
+  CGF.Builder.SetInsertPoint(EndBB);
+  return RValue::get(ReturnsNew ? New : static_cast<llvm::Value *>(Old));
+}
+
 static void EmitAtomicOp(CodeGenFunction &CGF, AtomicExpr *E, Address Dest,
                          Address Ptr, Address Val1, Address Val2,
                          Address ExpectedResult, llvm::Value *IsWeak,
@@ -1109,6 +1299,27 @@ RValue CodeGenFunction::EmitAtomicExpr(AtomicExpr *E) {
   LValue AtomicVal = MakeAddrLValue(Ptr, AtomicTy);
   AtomicInfo Atomics(*this, AtomicVal);
 
+  // A `_BitInt(N)` read-modify-write whose value width has padding bits, or
+  // whose size forces a libcall, cannot use a single atomicrmw: the op would
+  // carry into / compare the padding bits, and no arbitrary-width
+  // __atomic_fetch_* libcall exists. Emit a compare-exchange loop instead.
+  // Bitwise and/or/xor are exact even with padding, so only the wide case needs
+  // the loop for them. load/store/exchange/compare_exchange keep their paths.
+  if (MemTy->isBitIntType()) {
+    llvm::AtomicRMWInst::BinOp BinOp;
+    bool RMWReturnsNew;
+    if (classifyBitIntRMW(E->getOp(), MemTy->isSignedIntegerType(), BinOp,
+                          RMWReturnsNew)) {
+      bool WideOrNonPow2 = (Size & (Size - 1)) != 0 || Size > 16;
+      bool Bitwise = BinOp == llvm::AtomicRMWInst::And ||
+                     BinOp == llvm::AtomicRMWInst::Or ||
+                     BinOp == llvm::AtomicRMWInst::Xor;
+      if (WideOrNonPow2 || (hasBitIntPadding(MemTy, getContext()) && !Bitwise))
+        return emitBitIntAtomicRMWLoop(*this, E, Ptr, Val1, AtomicTy, BinOp,
+                                       RMWReturnsNew, Order);
+    }
+  }
+
   Address OriginalVal1 = Val1;
   if (ShouldCastToIntPtrTy) {
     Ptr = Atomics.castToAtomicIntPointer(Ptr);
diff --git a/clang/lib/Sema/SemaChecking.cpp b/clang/lib/Sema/SemaChecking.cpp
index b8a3f48a32f24..874ce2bf1ce3a 100644
--- a/clang/lib/Sema/SemaChecking.cpp
+++ b/clang/lib/Sema/SemaChecking.cpp
@@ -5460,11 +5460,6 @@ ExprResult Sema::BuildAtomicExpr(SourceRange CallRange, SourceRange ExprRange,
                 ? 0
                 : 1);
 
-  if (ValType->isBitIntType()) {
-    Diag(Ptr->getExprLoc(), diag::err_atomic_builtin_bit_int_prohibit);
-    return ExprError();
-  }
-
   return AE;
 }
 
diff --git a/clang/lib/Sema/SemaType.cpp b/clang/lib/Sema/SemaType.cpp
index d2bb312feadc1..4a3506c281acf 100644
--- a/clang/lib/Sema/SemaType.cpp
+++ b/clang/lib/Sema/SemaType.cpp
@@ -10412,8 +10412,6 @@ QualType Sema::BuildAtomicType(QualType T, SourceLocation Loc) {
     else if (!T.isTriviallyCopyableType(Context) && getLangOpts().CPlusPlus)
       // Some other non-trivially-copyable type (probably a C++ class)
       DisallowedKind = 7;
-    else if (T->isBitIntType())
-      DisallowedKind = 8;
     else if (getLangOpts().C23 && T->isUndeducedAutoType())
       // _Atomic auto is prohibited in C23
       DisallowedKind = 9;
diff --git a/clang/test/CodeGen/atomic-bitint.c b/clang/test/CodeGen/atomic-bitint.c
new file mode 100644
index 0000000000000..358b530e8a792
--- /dev/null
+++ b/clang/test/CodeGen/atomic-bitint.c
@@ -0,0 +1,90 @@
+// RUN: %clang_cc1 -std=c23 -triple x86_64-unknown-linux-gnu -emit-llvm %s -o - | FileCheck %s
+//
+// Atomic operations on _BitInt(N). load/store/exchange/compare-exchange and
+// bitwise RMW lower directly; arithmetic RMW on a padded width and any RMW on a
+// width too wide for an inline atomicrmw lower to a compare-exchange loop that
+// computes at the value width.
+
+typedef _BitInt(37)          S37;
+typedef unsigned _BitInt(37) U37;
+typedef _BitInt(64)          S64;
+typedef _BitInt(128)         S128;
+typedef _BitInt(256)         S256;
+
+// CHECK-LABEL: @ld37(
+// CHECK: load atomic i64
+S37 ld37(_Atomic(S37) *p) { return __c11_atomic_load(p, __ATOMIC_SEQ_CST); }
+
+// CHECK-LABEL: @st37(
+// CHECK: store atomic i64
+void st37(_Atomic(S37) *p, S37 v) { __c11_atomic_store(p, v, __ATOMIC_SEQ_CST); }
+
+// CHECK-LABEL: @xchg37(
+// CHECK: atomicrmw xchg ptr {{.*}} i64
+S37 xchg37(_Atomic(S37) *p, S37 v) {
+  return __c11_atomic_exchange(p, v, __ATOMIC_SEQ_CST);
+}
+
+// CHECK-LABEL: @cas37(
+// CHECK: cmpxchg ptr {{.*}} i64
+_Bool cas37(_Atomic(S37) *p, S37 *e, S37 d) {
+  return __c11_atomic_compare_exchange_strong(p, e, d, __ATOMIC_SEQ_CST,
+                                              __ATOMIC_SEQ_CST);
+}
+
+// Bitwise RMW on a padded width keeps the direct atomicrmw: it is exact.
+// CHECK-LABEL: @and37(
+// CHECK: atomicrmw and ptr {{.*}} i64
+// CHECK-NOT: cmpxchg
+S37 and37(_Atomic(S37) *p, S37 v) {
+  return __c11_atomic_fetch_and(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Arithmetic RMW on a padded width becomes a compare-exchange loop, not a bare
+// atomicrmw that would carry into the padding bits.
+// CHECK-LABEL: @add37(
+// CHECK: atomicrmw.start:
+// CHECK: cmpxchg weak ptr {{.*}} i64
+// CHECK-NOT: atomicrmw add
+S37 add37(_Atomic(S37) *p, S37 v) {
+  return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Signed min is computed at the value width, so the sign bit is at bit N-1.
+// CHECK-LABEL: @min37(
+// CHECK: icmp sle i37
+// CHECK: select i1
+// CHECK: cmpxchg weak ptr {{.*}} i64
+U37 min37(_Atomic(S37) *p, S37 v) {
+  return __c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
+}
+
+// No padding: direct atomicrmw, no loop.
+// CHECK-LABEL: @add64(
+// CHECK: atomicrmw add ptr {{.*}} i64
+// CHECK-NOT: cmpxchg
+S64 add64(_Atomic(S64) *p, S64 v) {
+  return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}
+
+// CHECK-LABEL: @add128(
+// CHECK: atomicrmw add ptr {{.*}} i128
+S128 add128(_Atomic(S128) *p, S128 v) {
+  return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Wide: no inline atomicrmw and no arbitrary-width __atomic_fetch_add libcall,
+// so the loop calls __atomic_compare_exchange.
+// CHECK-LABEL: @add256(
+// CHECK: call {{.*}}@__atomic_compare_exchange
+// CHECK-NOT: cmpxchg
+S256 add256(_Atomic(S256) *p, S256 v) {
+  return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Wide bitwise also needs the loop: the wide path has no inline atomicrmw.
+// CHECK-LABEL: @or256(
+// CHECK: call {{.*}}@__atomic_compare_exchange
+S256 or256(_Atomic(S256) *p, S256 v) {
+  return __c11_atomic_fetch_or(p, v, __ATOMIC_SEQ_CST);
+}
diff --git a/clang/test/Sema/builtins.c b/clang/test/Sema/builtins.c
index b669ee68cdd95..57e0eefdb772b 100644
--- a/clang/test/Sema/builtins.c
+++ b/clang/test/Sema/builtins.c
@@ -281,7 +281,7 @@ void test_ei_i42i(_BitInt(42) *ptr, int value) {
   // expected-warning at +1 {{the semantics of this intrinsic changed with GCC version 4.4 - the newer semantics are provided here}}
   __sync_nand_and_fetch(ptr, value); // expected-error {{atomic memory operand must have a power-of-two size}}
 
-  __atomic_fetch_add(ptr, 1, 0); // expected-error {{argument to atomic builtin of type '_BitInt' is not supported}}
+  __atomic_fetch_add(ptr, 1, 0); // expect success: the GNU atomic builtins support _BitInt
 }
 
 void test_ei_i64i(_BitInt(64) *ptr, int value) {
@@ -289,7 +289,7 @@ void test_ei_i64i(_BitInt(64) *ptr, int value) {
   // expected-warning at +1 {{the semantics of this intrinsic changed with GCC version 4.4 - the newer semantics are provided here}}
   __sync_nand_and_fetch(ptr, value); // expect success
 
-  __atomic_fetch_add(ptr, 1, 0); // expected-error {{argument to atomic builtin of type '_BitInt' is not supported}}
+  __atomic_fetch_add(ptr, 1, 0); // expect success
 }
 
 void test_ei_ii42(int *ptr, _BitInt(42) value) {
diff --git a/clang/test/SemaCXX/ext-int.cpp b/clang/test/SemaCXX/ext-int.cpp
index 281ae3d3c1779..f62a07a84200e 100644
--- a/clang/test/SemaCXX/ext-int.cpp
+++ b/clang/test/SemaCXX/ext-int.cpp
@@ -121,13 +121,11 @@ _Complex _BitInt(3) Cmplx;
 // expected-error at +1{{'_Complex _BitInt' is invalid}}
 typedef _Complex _BitInt(3) Cmp;
 
-// Reject cases of _Atomic:
-// expected-error at +1{{_Atomic cannot be applied to integer type '_BitInt(4)'}}
-_Atomic _BitInt(4) TooSmallAtomic;
-// expected-error at +1{{_Atomic cannot be applied to integer type '_BitInt(9)'}}
+// _Atomic accepts any _BitInt width: small and non-power-of-2 included.
+// Sizes the target cannot lower inline use the __atomic_* libcalls.
+_Atomic _BitInt(4) SmallAtomic;
 _Atomic _BitInt(9) NotPow2Atomic;
-// expected-error at +1{{_Atomic cannot be applied to integer type '_BitInt(128)'}}
-_Atomic _BitInt(128) JustRightAtomic;
+_Atomic _BitInt(128) WideAtomic;
 
 // Test result types of Unary/Bitwise/Binary Operations:
 void Ops() {
diff --git a/libcxx/test/libcxx/atomics/bit-int.verify.cpp b/libcxx/test/libcxx/atomics/bit-int.verify.cpp
deleted file mode 100644
index 03880a1b6215c..0000000000000
--- a/libcxx/test/libcxx/atomics/bit-int.verify.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-//===----------------------------------------------------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-
-// <atomic>
-
-// Make sure that `std::atomic` doesn't work with `_BitInt`. The intent is to
-// disable them for now until their behavior can be designed better later.
-// See https://reviews.llvm.org/D84049 for details.
-
-// UNSUPPORTED: c++03
-
-#include <atomic>
-
-void f() {
-  // expected-error@*:*1 {{_Atomic cannot be applied to integer type '_BitInt(32)'}}
-  std::atomic<_BitInt(32)> x(42);
-}

>From 92e21eaa61bf0f26c3ba825545845f531d79c4d6 Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Tue, 23 Jun 2026 11:53:49 +0200
Subject: [PATCH 2/9] [Clang][POC] Add C23/C2y Sema tests for atomic _BitInt(N)

C23 requires the type-generic atomic interfaces to accept _BitInt(N), so
_Atomic(_BitInt(N)) is well-formed at every width. Add a Sema acceptance
test covering the _Atomic specifier and the __c11_atomic_*/__atomic_*
builtins in C23 and C2y modes, and a -std=c2y run of the codegen test.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/test/CodeGen/atomic-bitint.c |  1 +
 clang/test/Sema/atomic-bitint.c    | 40 ++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)
 create mode 100644 clang/test/Sema/atomic-bitint.c

diff --git a/clang/test/CodeGen/atomic-bitint.c b/clang/test/CodeGen/atomic-bitint.c
index 358b530e8a792..9fa259776bf62 100644
--- a/clang/test/CodeGen/atomic-bitint.c
+++ b/clang/test/CodeGen/atomic-bitint.c
@@ -1,4 +1,5 @@
 // RUN: %clang_cc1 -std=c23 -triple x86_64-unknown-linux-gnu -emit-llvm %s -o - | FileCheck %s
+// RUN: %clang_cc1 -std=c2y -triple x86_64-unknown-linux-gnu -emit-llvm %s -o - | FileCheck %s
 //
 // Atomic operations on _BitInt(N). load/store/exchange/compare-exchange and
 // bitwise RMW lower directly; arithmetic RMW on a padded width and any RMW on a
diff --git a/clang/test/Sema/atomic-bitint.c b/clang/test/Sema/atomic-bitint.c
new file mode 100644
index 0000000000000..1466412bae732
--- /dev/null
+++ b/clang/test/Sema/atomic-bitint.c
@@ -0,0 +1,40 @@
+// RUN: %clang_cc1 %s -fsyntax-only -verify -triple x86_64-unknown-linux-gnu -std=c23
+// RUN: %clang_cc1 %s -fsyntax-only -verify -triple x86_64-unknown-linux-gnu -std=c2y
+//
+// C23 requires the type-generic atomic interfaces to accept _BitInt(N) for
+// every N, so _Atomic(_BitInt(N)) is well-formed at every width. Widths past
+// 128 are x86-only.
+
+// expected-no-diagnostics
+
+_Atomic(_BitInt(4))   a4;    // small
+_Atomic(_BitInt(9))   a9;    // non-power-of-two
+_Atomic(_BitInt(37))  a37;   // padded
+_Atomic(_BitInt(64))  a64;
+_Atomic(_BitInt(128)) a128;
+_Atomic(_BitInt(256)) a256;  // wider than any inline atomic
+
+// The _Atomic qualifier spelling is equally valid.
+_Atomic _BitInt(9) q9;
+
+static_assert(sizeof(_Atomic(_BitInt(37))) == 8);
+static_assert(sizeof(_Atomic(_BitInt(128))) == 16);
+static_assert(sizeof(_Atomic(_BitInt(256))) == 32);
+
+void c11_builtins(_Atomic(_BitInt(37)) *p, _BitInt(37) v, _BitInt(37) *e) {
+  (void)__c11_atomic_load(p, __ATOMIC_SEQ_CST);
+  __c11_atomic_store(p, v, __ATOMIC_SEQ_CST);
+  (void)__c11_atomic_exchange(p, v, __ATOMIC_SEQ_CST);
+  (void)__c11_atomic_compare_exchange_strong(p, e, v, __ATOMIC_SEQ_CST,
+                                             __ATOMIC_SEQ_CST);
+  (void)__c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+  (void)__c11_atomic_fetch_and(p, v, __ATOMIC_SEQ_CST);
+  (void)__c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
+}
+
+// The GNU __atomic_* builtins take a plain _BitInt pointer.
+void gnu_builtins(_BitInt(37) *p, _BitInt(37) v) {
+  (void)__atomic_load_n(p, __ATOMIC_SEQ_CST);
+  __atomic_store_n(p, v, __ATOMIC_SEQ_CST);
+  (void)__atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}

>From c0e011fb61086a5a0cae13c54abe21ab312e85c7 Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Tue, 23 Jun 2026 13:51:48 +0200
Subject: [PATCH 3/9] [Clang][POC] Extend atomic _BitInt Sema test: width 4096
 + RISC-V

Add a _BitInt(4096) acceptance case and a riscv64 RUN line. The atomic
code imposes no width cap of its own, so the only limit is the target's
getMaxBitIntWidth(); x86 and RISC-V allow widths past 128, others cap at
128. Correct the comment that claimed wide widths were x86-only.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/test/Sema/atomic-bitint.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/clang/test/Sema/atomic-bitint.c b/clang/test/Sema/atomic-bitint.c
index 1466412bae732..fbb4c518438fb 100644
--- a/clang/test/Sema/atomic-bitint.c
+++ b/clang/test/Sema/atomic-bitint.c
@@ -1,18 +1,21 @@
 // RUN: %clang_cc1 %s -fsyntax-only -verify -triple x86_64-unknown-linux-gnu -std=c23
 // RUN: %clang_cc1 %s -fsyntax-only -verify -triple x86_64-unknown-linux-gnu -std=c2y
+// RUN: %clang_cc1 %s -fsyntax-only -verify -triple riscv64-unknown-linux-gnu -std=c23
 //
 // C23 requires the type-generic atomic interfaces to accept _BitInt(N) for
-// every N, so _Atomic(_BitInt(N)) is well-formed at every width. Widths past
-// 128 are x86-only.
+// every N, so _Atomic(_BitInt(N)) is well-formed at every width. The atomic
+// code imposes no width cap of its own; widths past 128 are available wherever
+// the target accepts _BitInt > 128 (x86 and RISC-V today).
 
 // expected-no-diagnostics
 
-_Atomic(_BitInt(4))   a4;    // small
-_Atomic(_BitInt(9))   a9;    // non-power-of-two
-_Atomic(_BitInt(37))  a37;   // padded
-_Atomic(_BitInt(64))  a64;
-_Atomic(_BitInt(128)) a128;
-_Atomic(_BitInt(256)) a256;  // wider than any inline atomic
+_Atomic(_BitInt(4))    a4;     // small
+_Atomic(_BitInt(9))    a9;     // non-power-of-two
+_Atomic(_BitInt(37))   a37;    // padded
+_Atomic(_BitInt(64))   a64;
+_Atomic(_BitInt(128))  a128;
+_Atomic(_BitInt(256))  a256;   // wider than any inline atomic
+_Atomic(_BitInt(4096)) a4096;  // far past the inline range
 
 // The _Atomic qualifier spelling is equally valid.
 _Atomic _BitInt(9) q9;

>From 92e2068b4837befaa51c040c0c882ea7f10cd57f Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Fri, 26 Jun 2026 21:21:00 +0200
Subject: [PATCH 4/9] [Clang][NFC] Address review: atomic _BitInt diagnostic
 list and cast

Drop the now-unreachable _BitInt ("integer") option from
err_atomic_specifier_bad_type and renumber the _Atomic-auto/C23 case to
fill the gap; the rendered text is unchanged. Use static_cast instead of
a C-style cast in atomicOrderOrSeqCst.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/include/clang/Basic/DiagnosticSemaKinds.td | 4 ++--
 clang/lib/CodeGen/CGAtomic.cpp                   | 2 +-
 clang/lib/Sema/SemaType.cpp                      | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/clang/include/clang/Basic/DiagnosticSemaKinds.td b/clang/include/clang/Basic/DiagnosticSemaKinds.td
index cde99dfb16ec5..414357c5a7c73 100644
--- a/clang/include/clang/Basic/DiagnosticSemaKinds.td
+++ b/clang/include/clang/Basic/DiagnosticSemaKinds.td
@@ -7475,8 +7475,8 @@ def err_func_def_incomplete_result : Error<
 def err_atomic_specifier_bad_type
     : Error<"_Atomic cannot be applied to "
             "%select{incomplete |array |function |reference |atomic |qualified "
-            "|sizeless ||integer |}0type "
-            "%1 %select{|||||||which is not trivially copyable||in C23}0">;
+            "|sizeless ||}0type "
+            "%1 %select{|||||||which is not trivially copyable|in C23}0">;
 def warn_atomic_member_access : Warning<
   "accessing a member of an atomic structure or union is undefined behavior">,
   InGroup<DiagGroup<"atomic-access">>, DefaultError;
diff --git a/clang/lib/CodeGen/CGAtomic.cpp b/clang/lib/CodeGen/CGAtomic.cpp
index 66c059fd40e26..820849f5974c0 100644
--- a/clang/lib/CodeGen/CGAtomic.cpp
+++ b/clang/lib/CodeGen/CGAtomic.cpp
@@ -683,7 +683,7 @@ static llvm::AtomicOrdering atomicOrderOrSeqCst(llvm::Value *Order) {
   auto *C = dyn_cast<llvm::ConstantInt>(Order);
   if (!C || !llvm::isValidAtomicOrderingCABI(C->getZExtValue()))
     return llvm::AtomicOrdering::SequentiallyConsistent;
-  switch ((llvm::AtomicOrderingCABI)C->getZExtValue()) {
+  switch (static_cast<llvm::AtomicOrderingCABI>(C->getZExtValue())) {
   case llvm::AtomicOrderingCABI::relaxed:
     return llvm::AtomicOrdering::Monotonic;
   case llvm::AtomicOrderingCABI::consume:
diff --git a/clang/lib/Sema/SemaType.cpp b/clang/lib/Sema/SemaType.cpp
index 4a3506c281acf..f76244d5f2871 100644
--- a/clang/lib/Sema/SemaType.cpp
+++ b/clang/lib/Sema/SemaType.cpp
@@ -10414,7 +10414,7 @@ QualType Sema::BuildAtomicType(QualType T, SourceLocation Loc) {
       DisallowedKind = 7;
     else if (getLangOpts().C23 && T->isUndeducedAutoType())
       // _Atomic auto is prohibited in C23
-      DisallowedKind = 9;
+      DisallowedKind = 8;
 
     if (DisallowedKind != -1) {
       Diag(Loc, diag::err_atomic_specifier_bad_type) << DisallowedKind << T;

>From ece57436ed0217f4c1ad4ca92f10182793ec109f Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Fri, 26 Jun 2026 21:21:02 +0200
Subject: [PATCH 5/9] [Clang][NFC] Regenerate atomic-bitint.c checks with
 update_cc_test_checks

The prior spot-checks did not show the lowering. Regenerate complete
check lines so the compare-exchange loop, the width-N arithmetic, and the
sext/zext memory canonicalization are all visible.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/test/CodeGen/atomic-bitint.c | 350 ++++++++++++++++++++++++++---
 1 file changed, 321 insertions(+), 29 deletions(-)

diff --git a/clang/test/CodeGen/atomic-bitint.c b/clang/test/CodeGen/atomic-bitint.c
index 9fa259776bf62..6476c26a0f0dd 100644
--- a/clang/test/CodeGen/atomic-bitint.c
+++ b/clang/test/CodeGen/atomic-bitint.c
@@ -1,3 +1,4 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --version 6
 // RUN: %clang_cc1 -std=c23 -triple x86_64-unknown-linux-gnu -emit-llvm %s -o - | FileCheck %s
 // RUN: %clang_cc1 -std=c2y -triple x86_64-unknown-linux-gnu -emit-llvm %s -o - | FileCheck %s
 //
@@ -12,80 +13,371 @@ typedef _BitInt(64)          S64;
 typedef _BitInt(128)         S128;
 typedef _BitInt(256)         S256;
 
-// CHECK-LABEL: @ld37(
-// CHECK: load atomic i64
+// CHECK-LABEL: define dso_local i64 @ld37(
+// CHECK-SAME: ptr noundef [[P:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load atomic i64, ptr [[TMP0]] seq_cst, align 8
+// CHECK-NEXT:    store i64 [[TMP1]], ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    store i37 [[LOADEDV]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP3]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
 S37 ld37(_Atomic(S37) *p) { return __c11_atomic_load(p, __ATOMIC_SEQ_CST); }
 
-// CHECK-LABEL: @st37(
-// CHECK: store atomic i64
+// CHECK-LABEL: define dso_local void @st37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    store atomic i64 [[TMP3]], ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    ret void
+//
 void st37(_Atomic(S37) *p, S37 v) { __c11_atomic_store(p, v, __ATOMIC_SEQ_CST); }
 
-// CHECK-LABEL: @xchg37(
-// CHECK: atomicrmw xchg ptr {{.*}} i64
+// CHECK-LABEL: define dso_local i64 @xchg37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP4:%.*]] = atomicrmw xchg ptr [[TMP1]], i64 [[TMP3]] seq_cst, align 8
+// CHECK-NEXT:    store i64 [[TMP4]], ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP5]] to i37
+// CHECK-NEXT:    store i37 [[LOADEDV3]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP6]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
 S37 xchg37(_Atomic(S37) *p, S37 v) {
   return __c11_atomic_exchange(p, v, __ATOMIC_SEQ_CST);
 }
 
-// CHECK-LABEL: @cas37(
-// CHECK: cmpxchg ptr {{.*}} i64
+// CHECK-LABEL: define dso_local zeroext i1 @cas37(
+// CHECK-SAME: ptr noundef [[P:%.*]], ptr noundef [[E:%.*]], i64 noundef [[D_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[D:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[E_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[D_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[CMPXCHG_BOOL:%.*]] = alloca i8, align 1
+// CHECK-NEXT:    store i64 [[D_COERCE]], ptr [[D]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[D]], align 8
+// CHECK-NEXT:    [[D1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    store ptr [[E]], ptr [[E_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[D1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[D_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load ptr, ptr [[E_ADDR]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[D_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP4:%.*]] = load i64, ptr [[TMP2]], align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP5]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = extractvalue { i64, i1 } [[TMP6]], 0
+// CHECK-NEXT:    [[TMP8:%.*]] = extractvalue { i64, i1 } [[TMP6]], 1
+// CHECK-NEXT:    br i1 [[TMP8]], label %[[CMPXCHG_CONTINUE:.*]], label %[[CMPXCHG_STORE_EXPECTED:.*]]
+// CHECK:       [[CMPXCHG_STORE_EXPECTED]]:
+// CHECK-NEXT:    store i64 [[TMP7]], ptr [[TMP2]], align 8
+// CHECK-NEXT:    br label %[[CMPXCHG_CONTINUE]]
+// CHECK:       [[CMPXCHG_CONTINUE]]:
+// CHECK-NEXT:    [[STOREDV3:%.*]] = zext i1 [[TMP8]] to i8
+// CHECK-NEXT:    store i8 [[STOREDV3]], ptr [[CMPXCHG_BOOL]], align 1
+// CHECK-NEXT:    [[TMP9:%.*]] = load i8, ptr [[CMPXCHG_BOOL]], align 1
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = icmp ne i8 [[TMP9]], 0
+// CHECK-NEXT:    ret i1 [[LOADEDV4]]
+//
 _Bool cas37(_Atomic(S37) *p, S37 *e, S37 d) {
   return __c11_atomic_compare_exchange_strong(p, e, d, __ATOMIC_SEQ_CST,
                                               __ATOMIC_SEQ_CST);
 }
 
 // Bitwise RMW on a padded width keeps the direct atomicrmw: it is exact.
-// CHECK-LABEL: @and37(
-// CHECK: atomicrmw and ptr {{.*}} i64
-// CHECK-NOT: cmpxchg
+// CHECK-LABEL: define dso_local i64 @and37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP4:%.*]] = atomicrmw and ptr [[TMP1]], i64 [[TMP3]] seq_cst, align 8
+// CHECK-NEXT:    store i64 [[TMP4]], ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP5]] to i37
+// CHECK-NEXT:    store i37 [[LOADEDV3]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP6]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
 S37 and37(_Atomic(S37) *p, S37 v) {
   return __c11_atomic_fetch_and(p, v, __ATOMIC_SEQ_CST);
 }
 
 // Arithmetic RMW on a padded width becomes a compare-exchange loop, not a bare
 // atomicrmw that would carry into the padding bits.
-// CHECK-LABEL: @add37(
-// CHECK: atomicrmw.start:
-// CHECK: cmpxchg weak ptr {{.*}} i64
-// CHECK-NOT: atomicrmw add
+// CHECK-LABEL: define dso_local i64 @add37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] monotonic, align 8
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[ATOMIC_LOAD]] to i37
+// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
+// CHECK:       [[ATOMICRMW_START]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i37 [ [[LOADEDV4]], %[[ENTRY]] ], [ [[LOADEDV7:%.*]], %[[ATOMICRMW_START]] ]
+// CHECK-NEXT:    [[NEW:%.*]] = add i37 [[TMP4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[TMP4]] to i64
+// CHECK-NEXT:    [[STOREDV6:%.*]] = sext i37 [[NEW]] to i64
+// CHECK-NEXT:    [[TMP5:%.*]] = cmpxchg weak ptr [[TMP1]], i64 [[STOREDV5]], i64 [[STOREDV6]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
+// CHECK-NEXT:    [[TMP7:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
+// CHECK-NEXT:    [[LOADEDV7]] = trunc i64 [[TMP6]] to i37
+// CHECK-NEXT:    br i1 [[TMP7]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
+// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    store i37 [[TMP4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP8:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP8]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
 S37 add37(_Atomic(S37) *p, S37 v) {
   return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
 }
 
 // Signed min is computed at the value width, so the sign bit is at bit N-1.
-// CHECK-LABEL: @min37(
-// CHECK: icmp sle i37
-// CHECK: select i1
-// CHECK: cmpxchg weak ptr {{.*}} i64
+// CHECK-LABEL: define dso_local i64 @min37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] monotonic, align 8
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[ATOMIC_LOAD]] to i37
+// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
+// CHECK:       [[ATOMICRMW_START]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i37 [ [[LOADEDV4]], %[[ENTRY]] ], [ [[LOADEDV7:%.*]], %[[ATOMICRMW_START]] ]
+// CHECK-NEXT:    [[TMP5:%.*]] = icmp sle i37 [[TMP4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[NEW:%.*]] = select i1 [[TMP5]], i37 [[TMP4]], i37 [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[TMP4]] to i64
+// CHECK-NEXT:    [[STOREDV6:%.*]] = sext i37 [[NEW]] to i64
+// CHECK-NEXT:    [[TMP6:%.*]] = cmpxchg weak ptr [[TMP1]], i64 [[STOREDV5]], i64 [[STOREDV6]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = extractvalue { i64, i1 } [[TMP6]], 0
+// CHECK-NEXT:    [[TMP8:%.*]] = extractvalue { i64, i1 } [[TMP6]], 1
+// CHECK-NEXT:    [[LOADEDV7]] = trunc i64 [[TMP7]] to i37
+// CHECK-NEXT:    br i1 [[TMP8]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
+// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    store i37 [[TMP4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP9:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP9]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
 U37 min37(_Atomic(S37) *p, S37 v) {
   return __c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
 }
 
 // No padding: direct atomicrmw, no loop.
-// CHECK-LABEL: @add64(
-// CHECK: atomicrmw add ptr {{.*}} i64
-// CHECK-NOT: cmpxchg
+// CHECK-LABEL: define dso_local i64 @add64(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    store i64 [[V]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    store i64 [[TMP1]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = atomicrmw add ptr [[TMP0]], i64 [[TMP2]] seq_cst, align 8
+// CHECK-NEXT:    store i64 [[TMP3]], ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP4:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    ret i64 [[TMP4]]
+//
 S64 add64(_Atomic(S64) *p, S64 v) {
   return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
 }
 
-// CHECK-LABEL: @add128(
-// CHECK: atomicrmw add ptr {{.*}} i128
+// CHECK-LABEL: define dso_local i128 @add128(
+// CHECK-SAME: ptr noundef [[P:%.*]], i128 noundef [[V:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*:]]
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i128, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i128, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i128, align 16
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    store i128 [[V]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load i128, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    store i128 [[TMP1]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i128, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = atomicrmw add ptr [[TMP0]], i128 [[TMP2]] seq_cst, align 16
+// CHECK-NEXT:    store i128 [[TMP3]], ptr [[ATOMIC_TEMP]], align 16
+// CHECK-NEXT:    [[TMP4:%.*]] = load i128, ptr [[ATOMIC_TEMP]], align 16
+// CHECK-NEXT:    ret i128 [[TMP4]]
+//
 S128 add128(_Atomic(S128) *p, S128 v) {
   return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
 }
 
 // Wide: no inline atomicrmw and no arbitrary-width __atomic_fetch_add libcall,
 // so the loop calls __atomic_compare_exchange.
-// CHECK-LABEL: @add256(
-// CHECK: call {{.*}}@__atomic_compare_exchange
-// CHECK-NOT: cmpxchg
+// CHECK-LABEL: define dso_local void @add256(
+// CHECK-SAME: ptr dead_on_unwind noalias writable sret(i256) align 8 [[AGG_RESULT:%.*]], ptr noundef [[P:%.*]], ptr noundef byval(i256) align 8 [[TMP0:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP1:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP2:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[V:%.*]] = load i256, ptr [[TMP0]], align 8
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    store i256 [[V]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i256, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    store i256 [[TMP2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i256, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 0)
+// CHECK-NEXT:    [[TMP4:%.*]] = load i256, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
+// CHECK:       [[ATOMICRMW_START]]:
+// CHECK-NEXT:    [[TMP5:%.*]] = phi i256 [ [[TMP4]], %[[ENTRY]] ], [ [[TMP6:%.*]], %[[ATOMICRMW_START]] ]
+// CHECK-NEXT:    [[NEW:%.*]] = add i256 [[TMP5]], [[TMP3]]
+// CHECK-NEXT:    store i256 [[TMP5]], ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    store i256 [[NEW]], ptr [[ATOMIC_TEMP2]], align 8
+// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5, i32 noundef 5)
+// CHECK-NEXT:    [[TMP6]] = load i256, ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
+// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    store i256 [[TMP5]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    store i256 [[TMP7]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    ret void
+//
 S256 add256(_Atomic(S256) *p, S256 v) {
   return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
 }
 
 // Wide bitwise also needs the loop: the wide path has no inline atomicrmw.
-// CHECK-LABEL: @or256(
-// CHECK: call {{.*}}@__atomic_compare_exchange
+// CHECK-LABEL: define dso_local void @or256(
+// CHECK-SAME: ptr dead_on_unwind noalias writable sret(i256) align 8 [[AGG_RESULT:%.*]], ptr noundef [[P:%.*]], ptr noundef byval(i256) align 8 [[TMP0:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP1:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP2:%.*]] = alloca i256, align 8
+// CHECK-NEXT:    [[V:%.*]] = load i256, ptr [[TMP0]], align 8
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    store i256 [[V]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i256, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    store i256 [[TMP2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i256, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 0)
+// CHECK-NEXT:    [[TMP4:%.*]] = load i256, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
+// CHECK:       [[ATOMICRMW_START]]:
+// CHECK-NEXT:    [[TMP5:%.*]] = phi i256 [ [[TMP4]], %[[ENTRY]] ], [ [[TMP6:%.*]], %[[ATOMICRMW_START]] ]
+// CHECK-NEXT:    [[NEW:%.*]] = or i256 [[TMP5]], [[TMP3]]
+// CHECK-NEXT:    store i256 [[TMP5]], ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    store i256 [[NEW]], ptr [[ATOMIC_TEMP2]], align 8
+// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5, i32 noundef 5)
+// CHECK-NEXT:    [[TMP6]] = load i256, ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
+// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    store i256 [[TMP5]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    store i256 [[TMP7]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    ret void
+//
 S256 or256(_Atomic(S256) *p, S256 v) {
   return __c11_atomic_fetch_or(p, v, __ATOMIC_SEQ_CST);
 }

>From 1c87c9994199bd82766b4d6d6f5edf392c225ba2 Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Sat, 27 Jun 2026 09:23:04 +0200
Subject: [PATCH 6/9] [Clang] Carry the raw representation in the _BitInt
 atomic RMW loop

The compare-exchange loop for padded and wide _BitInt atomics formed its
cmpxchg expected by re-canonicalizing the loaded value (sign/zero-extending
the truncated old). An object whose padding bits were non-canonical, e.g.
written through a union, then never matched that expected, so the cmpxchg
failed every iteration and the read-modify-write spun forever.

Reuse the existing EmitAtomicUpdate loop, which carries the raw loaded
representation as the expected and writes back a canonical desired computed
at value width N. The object converges on the first iteration regardless of
its padding, and the value it stores is canonical. See P0528.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/lib/CodeGen/CGAtomic.cpp     |  48 +++++-------
 clang/test/CodeGen/atomic-bitint.c | 120 ++++++++++++++---------------
 2 files changed, 77 insertions(+), 91 deletions(-)

diff --git a/clang/lib/CodeGen/CGAtomic.cpp b/clang/lib/CodeGen/CGAtomic.cpp
index 820849f5974c0..0043c79b398ee 100644
--- a/clang/lib/CodeGen/CGAtomic.cpp
+++ b/clang/lib/CodeGen/CGAtomic.cpp
@@ -702,9 +702,13 @@ static llvm::AtomicOrdering atomicOrderOrSeqCst(llvm::Value *Order) {
 /// Emit a `_BitInt(N)` atomic read-modify-write as a compare-exchange loop. A
 /// single `atomicrmw` on the padded memory integer would carry into / compare
 /// the padding bits, and no arbitrary-width `__atomic_fetch_*` libcall exists
-/// for wide widths. The loop computes the new value at width N and writes back
-/// a canonical (extended) representation via the existing cmpxchg helper, which
-/// also picks the inline-vs-libcall form by size.
+/// for wide widths.
+///
+/// The update computes at value width N (so the result wraps mod 2^N and is
+/// independent of padding). EmitAtomicUpdate carries the raw loaded
+/// representation as the cmpxchg expected, so an object with non-canonical
+/// padding (e.g. written through a union) still converges instead of spinning
+/// forever; the desired it writes back is canonical. See P0528.
 static RValue emitBitIntAtomicRMWLoop(CodeGenFunction &CGF, AtomicExpr *E,
                                       Address Ptr, Address Val1,
                                       QualType AtomicTy,
@@ -712,8 +716,6 @@ static RValue emitBitIntAtomicRMWLoop(CodeGenFunction &CGF, AtomicExpr *E,
                                       bool ReturnsNew, llvm::Value *Order) {
   QualType ValTy = E->getValueType();
   llvm::AtomicOrdering AO = atomicOrderOrSeqCst(Order);
-  llvm::AtomicOrdering Failure =
-      llvm::AtomicCmpXchgInst::getStrongestFailureOrdering(AO);
 
   LValue AtomicLVal = CGF.MakeAddrLValue(Ptr, AtomicTy);
   AtomicInfo Atomics(CGF, AtomicLVal);
@@ -721,31 +723,17 @@ static RValue emitBitIntAtomicRMWLoop(CodeGenFunction &CGF, AtomicExpr *E,
   llvm::Value *RHS =
       CGF.EmitLoadOfScalar(CGF.MakeAddrLValue(Val1, ValTy), E->getExprLoc());
 
-  RValue OldRV = Atomics.EmitAtomicLoad(
-      AggValueSlot::ignored(), E->getExprLoc(),
-      /*AsValue=*/true, llvm::AtomicOrdering::Monotonic, E->isVolatile());
-  llvm::Value *Init = OldRV.getScalarVal();
-
-  llvm::BasicBlock *StartBB = CGF.Builder.GetInsertBlock();
-  llvm::BasicBlock *LoopBB = CGF.createBasicBlock("atomicrmw.start", CGF.CurFn);
-  llvm::BasicBlock *EndBB = CGF.createBasicBlock("atomicrmw.end", CGF.CurFn);
-  CGF.Builder.CreateBr(LoopBB);
-  CGF.Builder.SetInsertPoint(LoopBB);
-
-  llvm::PHINode *Old = CGF.Builder.CreatePHI(Init->getType(), 2);
-  Old->addIncoming(Init, StartBB);
-
-  // Compute at the value width via the canonical RMW lowering, so the result
-  // wraps mod 2^N and never touches the padding bits.
-  llvm::Value *New = llvm::buildAtomicRMWValue(BinOp, CGF.Builder, Old, RHS);
-
-  auto Res = Atomics.EmitAtomicCompareExchange(
-      RValue::get(Old), RValue::get(New), AO, Failure, /*IsWeak=*/true);
-  Old->addIncoming(Res.first.getScalarVal(), CGF.Builder.GetInsertBlock());
-  CGF.Builder.CreateCondBr(Res.second, EndBB, LoopBB);
-
-  CGF.Builder.SetInsertPoint(EndBB);
-  return RValue::get(ReturnsNew ? New : static_cast<llvm::Value *>(Old));
+  llvm::Value *Old = nullptr, *New = nullptr;
+  Atomics.EmitAtomicUpdate(
+      AO,
+      [&](RValue OldRV) {
+        Old = OldRV.getScalarVal();
+        New = llvm::buildAtomicRMWValue(BinOp, CGF.Builder, Old, RHS);
+        return RValue::get(New);
+      },
+      E->isVolatile());
+
+  return RValue::get(ReturnsNew ? New : Old);
 }
 
 static void EmitAtomicOp(CodeGenFunction &CGF, AtomicExpr *E, Address Dest,
diff --git a/clang/test/CodeGen/atomic-bitint.c b/clang/test/CodeGen/atomic-bitint.c
index 6476c26a0f0dd..bc1e165fd90e3 100644
--- a/clang/test/CodeGen/atomic-bitint.c
+++ b/clang/test/CodeGen/atomic-bitint.c
@@ -178,6 +178,7 @@ S37 and37(_Atomic(S37) *p, S37 v) {
 // CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
 // CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
 // CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
 // CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
 // CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
@@ -191,23 +192,23 @@ S37 and37(_Atomic(S37) *p, S37 v) {
 // CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
-// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] monotonic, align 8
-// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[ATOMIC_LOAD]] to i37
-// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
-// CHECK:       [[ATOMICRMW_START]]:
-// CHECK-NEXT:    [[TMP4:%.*]] = phi i37 [ [[LOADEDV4]], %[[ENTRY]] ], [ [[LOADEDV7:%.*]], %[[ATOMICRMW_START]] ]
-// CHECK-NEXT:    [[NEW:%.*]] = add i37 [[TMP4]], [[LOADEDV3]]
-// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[TMP4]] to i64
-// CHECK-NEXT:    [[STOREDV6:%.*]] = sext i37 [[NEW]] to i64
-// CHECK-NEXT:    [[TMP5:%.*]] = cmpxchg weak ptr [[TMP1]], i64 [[STOREDV5]], i64 [[STOREDV6]] seq_cst seq_cst, align 8
-// CHECK-NEXT:    [[TMP6:%.*]] = extractvalue { i64, i1 } [[TMP5]], 0
-// CHECK-NEXT:    [[TMP7:%.*]] = extractvalue { i64, i1 } [[TMP5]], 1
-// CHECK-NEXT:    [[LOADEDV7]] = trunc i64 [[TMP6]] to i37
-// CHECK-NEXT:    br i1 [[TMP7]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
-// CHECK:       [[ATOMICRMW_END]]:
-// CHECK-NEXT:    store i37 [[TMP4]], ptr [[RETVAL]], align 8
-// CHECK-NEXT:    [[TMP8:%.*]] = load i37, ptr [[RETVAL]], align 8
-// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP8]] to i64
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i64 [ [[ATOMIC_LOAD]], %[[ENTRY]] ], [ [[TMP7:%.*]], %[[ATOMIC_CONT]] ]
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[TMP4]] to i37
+// CHECK-NEXT:    [[NEW:%.*]] = add i37 [[LOADEDV4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[NEW]] to i64
+// CHECK-NEXT:    store atomic i64 [[STOREDV5]], ptr [[ATOMIC_TEMP]] seq_cst, align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP5]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP7]] = extractvalue { i64, i1 } [[TMP6]], 0
+// CHECK-NEXT:    [[TMP8:%.*]] = extractvalue { i64, i1 } [[TMP6]], 1
+// CHECK-NEXT:    br i1 [[TMP8]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i37 [[LOADEDV4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP9:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP9]] to i64
 // CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
 //
 S37 add37(_Atomic(S37) *p, S37 v) {
@@ -223,6 +224,7 @@ S37 add37(_Atomic(S37) *p, S37 v) {
 // CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
 // CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
 // CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
 // CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
 // CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
 // CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
@@ -236,24 +238,24 @@ S37 add37(_Atomic(S37) *p, S37 v) {
 // CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
-// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] monotonic, align 8
-// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[ATOMIC_LOAD]] to i37
-// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
-// CHECK:       [[ATOMICRMW_START]]:
-// CHECK-NEXT:    [[TMP4:%.*]] = phi i37 [ [[LOADEDV4]], %[[ENTRY]] ], [ [[LOADEDV7:%.*]], %[[ATOMICRMW_START]] ]
-// CHECK-NEXT:    [[TMP5:%.*]] = icmp sle i37 [[TMP4]], [[LOADEDV3]]
-// CHECK-NEXT:    [[NEW:%.*]] = select i1 [[TMP5]], i37 [[TMP4]], i37 [[LOADEDV3]]
-// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[TMP4]] to i64
-// CHECK-NEXT:    [[STOREDV6:%.*]] = sext i37 [[NEW]] to i64
-// CHECK-NEXT:    [[TMP6:%.*]] = cmpxchg weak ptr [[TMP1]], i64 [[STOREDV5]], i64 [[STOREDV6]] seq_cst seq_cst, align 8
-// CHECK-NEXT:    [[TMP7:%.*]] = extractvalue { i64, i1 } [[TMP6]], 0
-// CHECK-NEXT:    [[TMP8:%.*]] = extractvalue { i64, i1 } [[TMP6]], 1
-// CHECK-NEXT:    [[LOADEDV7]] = trunc i64 [[TMP7]] to i37
-// CHECK-NEXT:    br i1 [[TMP8]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
-// CHECK:       [[ATOMICRMW_END]]:
-// CHECK-NEXT:    store i37 [[TMP4]], ptr [[RETVAL]], align 8
-// CHECK-NEXT:    [[TMP9:%.*]] = load i37, ptr [[RETVAL]], align 8
-// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP9]] to i64
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i64 [ [[ATOMIC_LOAD]], %[[ENTRY]] ], [ [[TMP8:%.*]], %[[ATOMIC_CONT]] ]
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[TMP4]] to i37
+// CHECK-NEXT:    [[TMP5:%.*]] = icmp sle i37 [[LOADEDV4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[NEW:%.*]] = select i1 [[TMP5]], i37 [[LOADEDV4]], i37 [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[NEW]] to i64
+// CHECK-NEXT:    store atomic i64 [[STOREDV5]], ptr [[ATOMIC_TEMP]] seq_cst, align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP6]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP8]] = extractvalue { i64, i1 } [[TMP7]], 0
+// CHECK-NEXT:    [[TMP9:%.*]] = extractvalue { i64, i1 } [[TMP7]], 1
+// CHECK-NEXT:    br i1 [[TMP9]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i37 [[LOADEDV4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP10:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP10]] to i64
 // CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
 //
 U37 min37(_Atomic(S37) *p, S37 v) {
@@ -309,7 +311,7 @@ S128 add128(_Atomic(S128) *p, S128 v) {
 // so the loop calls __atomic_compare_exchange.
 // CHECK-LABEL: define dso_local void @add256(
 // CHECK-SAME: ptr dead_on_unwind noalias writable sret(i256) align 8 [[AGG_RESULT:%.*]], ptr noundef [[P:%.*]], ptr noundef byval(i256) align 8 [[TMP0:%.*]]) #[[ATTR0]] {
-// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:  [[ENTRY:.*:]]
 // CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
 // CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i256, align 8
 // CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i256, align 8
@@ -323,21 +325,19 @@ S128 add128(_Atomic(S128) *p, S128 v) {
 // CHECK-NEXT:    [[TMP2:%.*]] = load i256, ptr [[V_ADDR]], align 8
 // CHECK-NEXT:    store i256 [[TMP2]], ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[TMP3:%.*]] = load i256, ptr [[DOTATOMICTMP]], align 8
-// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 0)
+// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 5)
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
 // CHECK-NEXT:    [[TMP4:%.*]] = load i256, ptr [[ATOMIC_TEMP]], align 8
-// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
-// CHECK:       [[ATOMICRMW_START]]:
-// CHECK-NEXT:    [[TMP5:%.*]] = phi i256 [ [[TMP4]], %[[ENTRY]] ], [ [[TMP6:%.*]], %[[ATOMICRMW_START]] ]
-// CHECK-NEXT:    [[NEW:%.*]] = add i256 [[TMP5]], [[TMP3]]
-// CHECK-NEXT:    store i256 [[TMP5]], ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    [[NEW:%.*]] = add i256 [[TMP4]], [[TMP3]]
 // CHECK-NEXT:    store i256 [[NEW]], ptr [[ATOMIC_TEMP2]], align 8
-// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5, i32 noundef 5)
-// CHECK-NEXT:    [[TMP6]] = load i256, ptr [[ATOMIC_TEMP1]], align 8
-// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
-// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    call void @__atomic_store(i64 noundef 32, ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5)
+// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], ptr noundef [[ATOMIC_TEMP1]], i32 noundef 5, i32 noundef 5)
+// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i256 [[TMP4]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
 // CHECK-NEXT:    store i256 [[TMP5]], ptr [[AGG_RESULT]], align 8
-// CHECK-NEXT:    [[TMP7:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
-// CHECK-NEXT:    store i256 [[TMP7]], ptr [[AGG_RESULT]], align 8
 // CHECK-NEXT:    ret void
 //
 S256 add256(_Atomic(S256) *p, S256 v) {
@@ -347,7 +347,7 @@ S256 add256(_Atomic(S256) *p, S256 v) {
 // Wide bitwise also needs the loop: the wide path has no inline atomicrmw.
 // CHECK-LABEL: define dso_local void @or256(
 // CHECK-SAME: ptr dead_on_unwind noalias writable sret(i256) align 8 [[AGG_RESULT:%.*]], ptr noundef [[P:%.*]], ptr noundef byval(i256) align 8 [[TMP0:%.*]]) #[[ATTR0]] {
-// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:  [[ENTRY:.*:]]
 // CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
 // CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i256, align 8
 // CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i256, align 8
@@ -361,21 +361,19 @@ S256 add256(_Atomic(S256) *p, S256 v) {
 // CHECK-NEXT:    [[TMP2:%.*]] = load i256, ptr [[V_ADDR]], align 8
 // CHECK-NEXT:    store i256 [[TMP2]], ptr [[DOTATOMICTMP]], align 8
 // CHECK-NEXT:    [[TMP3:%.*]] = load i256, ptr [[DOTATOMICTMP]], align 8
-// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 0)
+// CHECK-NEXT:    call void @__atomic_load(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], i32 noundef 5)
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
 // CHECK-NEXT:    [[TMP4:%.*]] = load i256, ptr [[ATOMIC_TEMP]], align 8
-// CHECK-NEXT:    br label %[[ATOMICRMW_START:.*]]
-// CHECK:       [[ATOMICRMW_START]]:
-// CHECK-NEXT:    [[TMP5:%.*]] = phi i256 [ [[TMP4]], %[[ENTRY]] ], [ [[TMP6:%.*]], %[[ATOMICRMW_START]] ]
-// CHECK-NEXT:    [[NEW:%.*]] = or i256 [[TMP5]], [[TMP3]]
-// CHECK-NEXT:    store i256 [[TMP5]], ptr [[ATOMIC_TEMP1]], align 8
+// CHECK-NEXT:    [[NEW:%.*]] = or i256 [[TMP4]], [[TMP3]]
 // CHECK-NEXT:    store i256 [[NEW]], ptr [[ATOMIC_TEMP2]], align 8
-// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5, i32 noundef 5)
-// CHECK-NEXT:    [[TMP6]] = load i256, ptr [[ATOMIC_TEMP1]], align 8
-// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMICRMW_END:.*]], label %[[ATOMICRMW_START]]
-// CHECK:       [[ATOMICRMW_END]]:
+// CHECK-NEXT:    call void @__atomic_store(i64 noundef 32, ptr noundef [[ATOMIC_TEMP1]], ptr noundef [[ATOMIC_TEMP2]], i32 noundef 5)
+// CHECK-NEXT:    [[CALL:%.*]] = call zeroext i1 @__atomic_compare_exchange(i64 noundef 32, ptr noundef [[TMP1]], ptr noundef [[ATOMIC_TEMP]], ptr noundef [[ATOMIC_TEMP1]], i32 noundef 5, i32 noundef 5)
+// CHECK-NEXT:    br i1 [[CALL]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i256 [[TMP4]], ptr [[AGG_RESULT]], align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
 // CHECK-NEXT:    store i256 [[TMP5]], ptr [[AGG_RESULT]], align 8
-// CHECK-NEXT:    [[TMP7:%.*]] = load i256, ptr [[AGG_RESULT]], align 8
-// CHECK-NEXT:    store i256 [[TMP7]], ptr [[AGG_RESULT]], align 8
 // CHECK-NEXT:    ret void
 //
 S256 or256(_Atomic(S256) *p, S256 v) {

>From e79f0f4e3fbec4c77fcae7f42f2d9dc5f3e85c5a Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Sat, 27 Jun 2026 09:34:49 +0200
Subject: [PATCH 7/9] [compiler-rt] Add runtime test for atomic _BitInt(N)

Single-threaded execution test for _Atomic(_BitInt(N)): per-op value
correctness on a padded inline width and on wide libcall widths, plus
dirty-padding convergence. An object with non-canonical padding (written
through a union) must not spin forever in the read-modify-write
compare-exchange loop. The IR-shape checks in
clang/test/CodeGen/atomic-bitint.c cannot witness non-termination.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 .../test/builtins/Unit/atomic_bitint_test.c   | 91 +++++++++++++++++++
 1 file changed, 91 insertions(+)
 create mode 100644 compiler-rt/test/builtins/Unit/atomic_bitint_test.c

diff --git a/compiler-rt/test/builtins/Unit/atomic_bitint_test.c b/compiler-rt/test/builtins/Unit/atomic_bitint_test.c
new file mode 100644
index 0000000000000..33a745348a6f0
--- /dev/null
+++ b/compiler-rt/test/builtins/Unit/atomic_bitint_test.c
@@ -0,0 +1,91 @@
+// RUN: %clang_builtins -std=c23 %s %librt -o %t && %run %t
+// REQUIRES: librt_has_atomic
+//===-- atomic_bitint_test.c - Test atomic ops on _BitInt -----------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// Runtime checks for atomic read-modify-write on _BitInt(N). A padded width
+// (37) exercises the inline compare-exchange loop; a wide width (256) exercises
+// the __atomic_compare_exchange libcall loop. Each op is cross-checked against
+// the same operation done non-atomically, and the dirty-padding cases confirm
+// the loop converges (a re-canonicalized expected would spin forever).
+//
+//===----------------------------------------------------------------------===//
+
+#include <assert.h>
+#include <stdio.h>
+
+typedef signed _BitInt(37) S37;
+typedef unsigned _BitInt(37) U37;
+typedef signed _BitInt(256) S256; // no padding (exactly 32 bytes)
+typedef signed _BitInt(200) S200; // padded: 200 value bits in 32-byte storage
+
+// Each macro runs the atomic op and asserts the returned old value and the
+// resulting object both match the non-atomic computation at width N.
+#define CHECK_FETCH(T, init, op, rhs, expr)                                    \
+  do {                                                                         \
+    _Atomic(T) a = (init);                                                     \
+    T old = __c11_atomic_fetch_##op(&a, (rhs), __ATOMIC_SEQ_CST);              \
+    assert(old == (T)(init));                                                  \
+    assert((T)a == (T)(expr));                                                 \
+  } while (0)
+
+static void test_ops(void) {
+  CHECK_FETCH(S37, 100, add, 5, 105);
+  CHECK_FETCH(S37, 100, sub, 40, 60);
+  CHECK_FETCH(S37, -3, add, 1, -2);
+  CHECK_FETCH(U37, 7, add, 9, 16);
+  CHECK_FETCH(S37, 0x15, and, 0x13, 0x11);
+  CHECK_FETCH(S37, 0x10, or, 5, 0x15);
+  CHECK_FETCH(S37, 0x1F, xor, 0x15, 0x0A);
+  CHECK_FETCH(S37, -5, min, -7, -7);    // signed: -7 < -5
+  CHECK_FETCH(U37, 5, min, (U37)-1, 5); // unsigned: 5 < 2^37-1
+  CHECK_FETCH(S37, 3, max, 9, 9);
+  CHECK_FETCH(S37, 0x15, nand, 0x13, (S37) ~(0x15 & 0x13));
+  // Wide widths: the libcall loop (no padding, and padded).
+  CHECK_FETCH(S256, 100, add, 5, 105);
+  CHECK_FETCH(S256, 1, or, 0xFE, 0xFF);
+  CHECK_FETCH(S200, 100, add, 5, 105);
+}
+
+// Seed non-canonical padding through a union, then RMW. A loop that carried a
+// re-canonicalized expected would never match memory and hang here.
+static void test_dirty_padding(void) {
+  union {
+    _Atomic(S37) a;
+    unsigned long b;
+  } s;
+  s.b = ((unsigned long)1 << 40) | 5u; // value bits 5, padding bit 40 set
+  S37 old = __c11_atomic_fetch_add(&s.a, 1, __ATOMIC_SEQ_CST);
+  assert(old == 5 && (S37)s.a == 6);
+
+  union {
+    _Atomic(U37) a;
+    unsigned long b;
+  } u;
+  u.b = ((unsigned long)3 << 50) | 7u;
+  U37 uold = __c11_atomic_fetch_add(&u.a, 1, __ATOMIC_SEQ_CST);
+  assert(uold == 7 && (U37)u.a == 8);
+
+  // Wide padded width (libcall loop): _BitInt(200) has 56 padding bits in its
+  // 32-byte storage. Set the overlay at value level (endian-independent): low
+  // 200 bits = 5, a padding bit (240) dirtied.
+  union {
+    _Atomic(S200) a;
+    unsigned _BitInt(256) full;
+  } w;
+  w.full = (unsigned _BitInt(256))5 | ((unsigned _BitInt(256))0xAA << 240);
+  S200 wold = __c11_atomic_fetch_add(&w.a, 1, __ATOMIC_SEQ_CST);
+  assert(wold == 5 && (S200)w.a == 6);
+}
+
+int main(void) {
+  test_ops();
+  test_dirty_padding();
+  printf("PASS\n");
+  return 0;
+}

>From 88f329b57018f8646b65484f0d36e92c9e54fd27 Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Sat, 27 Jun 2026 09:56:57 +0200
Subject: [PATCH 8/9] [Clang][test] Expand _BitInt atomic Sema and CodeGen
 coverage

Sema: add reject cases (non-_Atomic pointer, wrong arity, atomic _BitInt
bit-field) so lifting the _BitInt rejection does not silently drop the
atomic-specific checks, plus an __atomic_add_fetch (returns-new) accept.
CodeGen: add an unsigned arithmetic RMW (zero-extended desired) and signed
max / unsigned min, exercising the zext path and the icmp sgt/ule predicates
the previous functions did not.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 clang/test/CodeGen/atomic-bitint.c | 140 +++++++++++++++++++++++++++++
 clang/test/Sema/atomic-bitint.c    |  15 +++-
 2 files changed, 152 insertions(+), 3 deletions(-)

diff --git a/clang/test/CodeGen/atomic-bitint.c b/clang/test/CodeGen/atomic-bitint.c
index bc1e165fd90e3..dda8f644f3fec 100644
--- a/clang/test/CodeGen/atomic-bitint.c
+++ b/clang/test/CodeGen/atomic-bitint.c
@@ -262,6 +262,146 @@ U37 min37(_Atomic(S37) *p, S37 v) {
   return __c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
 }
 
+// Unsigned arithmetic RMW: the desired is zero-extended, not sign-extended.
+// CHECK-LABEL: define dso_local i64 @uadd37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = zext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = zext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i64 [ [[ATOMIC_LOAD]], %[[ENTRY]] ], [ [[TMP7:%.*]], %[[ATOMIC_CONT]] ]
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[TMP4]] to i37
+// CHECK-NEXT:    [[NEW:%.*]] = add i37 [[LOADEDV4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = zext i37 [[NEW]] to i64
+// CHECK-NEXT:    store atomic i64 [[STOREDV5]], ptr [[ATOMIC_TEMP]] seq_cst, align 8
+// CHECK-NEXT:    [[TMP5:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP5]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP7]] = extractvalue { i64, i1 } [[TMP6]], 0
+// CHECK-NEXT:    [[TMP8:%.*]] = extractvalue { i64, i1 } [[TMP6]], 1
+// CHECK-NEXT:    br i1 [[TMP8]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i37 [[LOADEDV4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP9:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP9]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
+U37 uadd37(_Atomic(U37) *p, U37 v) {
+  return __c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Signed max computes at the value width with a signed compare.
+// CHECK-LABEL: define dso_local i64 @max37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = sext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = sext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i64 [ [[ATOMIC_LOAD]], %[[ENTRY]] ], [ [[TMP8:%.*]], %[[ATOMIC_CONT]] ]
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[TMP4]] to i37
+// CHECK-NEXT:    [[TMP5:%.*]] = icmp sgt i37 [[LOADEDV4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[NEW:%.*]] = select i1 [[TMP5]], i37 [[LOADEDV4]], i37 [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = sext i37 [[NEW]] to i64
+// CHECK-NEXT:    store atomic i64 [[STOREDV5]], ptr [[ATOMIC_TEMP]] seq_cst, align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP6]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP8]] = extractvalue { i64, i1 } [[TMP7]], 0
+// CHECK-NEXT:    [[TMP9:%.*]] = extractvalue { i64, i1 } [[TMP7]], 1
+// CHECK-NEXT:    br i1 [[TMP9]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i37 [[LOADEDV4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP10:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP10]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
+S37 max37(_Atomic(S37) *p, S37 v) {
+  return __c11_atomic_fetch_max(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Unsigned min computes at the value width with an unsigned compare.
+// CHECK-LABEL: define dso_local i64 @umin37(
+// CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V_COERCE:%.*]]) #[[ATTR0]] {
+// CHECK-NEXT:  [[ENTRY:.*]]:
+// CHECK-NEXT:    [[RETVAL:%.*]] = alloca i37, align 8
+// CHECK-NEXT:    [[V:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[P_ADDR:%.*]] = alloca ptr, align 8
+// CHECK-NEXT:    [[V_ADDR:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[DOTATOMICTMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    [[ATOMIC_TEMP:%.*]] = alloca i64, align 8
+// CHECK-NEXT:    store i64 [[V_COERCE]], ptr [[V]], align 8
+// CHECK-NEXT:    [[TMP0:%.*]] = load i64, ptr [[V]], align 8
+// CHECK-NEXT:    [[V1:%.*]] = trunc i64 [[TMP0]] to i37
+// CHECK-NEXT:    store ptr [[P]], ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[STOREDV:%.*]] = zext i37 [[V1]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV]], ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[TMP1:%.*]] = load ptr, ptr [[P_ADDR]], align 8
+// CHECK-NEXT:    [[TMP2:%.*]] = load i64, ptr [[V_ADDR]], align 8
+// CHECK-NEXT:    [[LOADEDV:%.*]] = trunc i64 [[TMP2]] to i37
+// CHECK-NEXT:    [[STOREDV2:%.*]] = zext i37 [[LOADEDV]] to i64
+// CHECK-NEXT:    store i64 [[STOREDV2]], ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[DOTATOMICTMP]], align 8
+// CHECK-NEXT:    [[LOADEDV3:%.*]] = trunc i64 [[TMP3]] to i37
+// CHECK-NEXT:    [[ATOMIC_LOAD:%.*]] = load atomic i64, ptr [[TMP1]] seq_cst, align 8
+// CHECK-NEXT:    br label %[[ATOMIC_CONT:.*]]
+// CHECK:       [[ATOMIC_CONT]]:
+// CHECK-NEXT:    [[TMP4:%.*]] = phi i64 [ [[ATOMIC_LOAD]], %[[ENTRY]] ], [ [[TMP8:%.*]], %[[ATOMIC_CONT]] ]
+// CHECK-NEXT:    [[LOADEDV4:%.*]] = trunc i64 [[TMP4]] to i37
+// CHECK-NEXT:    [[TMP5:%.*]] = icmp ule i37 [[LOADEDV4]], [[LOADEDV3]]
+// CHECK-NEXT:    [[NEW:%.*]] = select i1 [[TMP5]], i37 [[LOADEDV4]], i37 [[LOADEDV3]]
+// CHECK-NEXT:    [[STOREDV5:%.*]] = zext i37 [[NEW]] to i64
+// CHECK-NEXT:    store atomic i64 [[STOREDV5]], ptr [[ATOMIC_TEMP]] seq_cst, align 8
+// CHECK-NEXT:    [[TMP6:%.*]] = load i64, ptr [[ATOMIC_TEMP]], align 8
+// CHECK-NEXT:    [[TMP7:%.*]] = cmpxchg ptr [[TMP1]], i64 [[TMP4]], i64 [[TMP6]] seq_cst seq_cst, align 8
+// CHECK-NEXT:    [[TMP8]] = extractvalue { i64, i1 } [[TMP7]], 0
+// CHECK-NEXT:    [[TMP9:%.*]] = extractvalue { i64, i1 } [[TMP7]], 1
+// CHECK-NEXT:    br i1 [[TMP9]], label %[[ATOMIC_EXIT:.*]], label %[[ATOMIC_CONT]]
+// CHECK:       [[ATOMIC_EXIT]]:
+// CHECK-NEXT:    store i37 [[LOADEDV4]], ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[TMP10:%.*]] = load i37, ptr [[RETVAL]], align 8
+// CHECK-NEXT:    [[COERCE_VAL_II:%.*]] = zext i37 [[TMP10]] to i64
+// CHECK-NEXT:    ret i64 [[COERCE_VAL_II]]
+//
+U37 umin37(_Atomic(U37) *p, U37 v) {
+  return __c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
+}
+
 // No padding: direct atomicrmw, no loop.
 // CHECK-LABEL: define dso_local i64 @add64(
 // CHECK-SAME: ptr noundef [[P:%.*]], i64 noundef [[V:%.*]]) #[[ATTR0]] {
diff --git a/clang/test/Sema/atomic-bitint.c b/clang/test/Sema/atomic-bitint.c
index fbb4c518438fb..3bd4faf22e7be 100644
--- a/clang/test/Sema/atomic-bitint.c
+++ b/clang/test/Sema/atomic-bitint.c
@@ -7,8 +7,6 @@
 // code imposes no width cap of its own; widths past 128 are available wherever
 // the target accepts _BitInt > 128 (x86 and RISC-V today).
 
-// expected-no-diagnostics
-
 _Atomic(_BitInt(4))    a4;     // small
 _Atomic(_BitInt(9))    a9;     // non-power-of-two
 _Atomic(_BitInt(37))   a37;    // padded
@@ -35,9 +33,20 @@ void c11_builtins(_Atomic(_BitInt(37)) *p, _BitInt(37) v, _BitInt(37) *e) {
   (void)__c11_atomic_fetch_min(p, v, __ATOMIC_SEQ_CST);
 }
 
-// The GNU __atomic_* builtins take a plain _BitInt pointer.
+// The GNU __atomic_* builtins take a plain _BitInt pointer; the _fetch forms
+// return the new value.
 void gnu_builtins(_BitInt(37) *p, _BitInt(37) v) {
   (void)__atomic_load_n(p, __ATOMIC_SEQ_CST);
   __atomic_store_n(p, v, __ATOMIC_SEQ_CST);
   (void)__atomic_fetch_add(p, v, __ATOMIC_SEQ_CST);
+  (void)__atomic_add_fetch(p, v, __ATOMIC_SEQ_CST);
+}
+
+// Lifting the _BitInt rejection must not lose the atomic-specific checks.
+void rejects(_Atomic(_BitInt(37)) *ap, _BitInt(37) *p, _BitInt(37) v) {
+  (void)__c11_atomic_load(ap); // expected-error {{too few arguments to function call}}
+  (void)__c11_atomic_fetch_add(p, v, __ATOMIC_SEQ_CST); // expected-error {{must be a pointer to _Atomic}}
 }
+struct WithAtomicBitIntField {
+  _Atomic(_BitInt(5)) f : 3; // expected-error {{bit-field 'f' has non-integral type}}
+};

>From 9d349acf9dc9f5c74c4103c3ed60252ca95ff53e Mon Sep 17 00:00:00 2001
From: Xavier Roche <xavier.roche at algolia.com>
Date: Sat, 27 Jun 2026 09:56:58 +0200
Subject: [PATCH 9/9] [compiler-rt] Harden the _BitInt atomic runtime test

Use uint64_t for the dirty-padding overlay (unsigned long is 32-bit on
LLP64, where the padding-bit shift was undefined). Read the storage back
after a converged RMW to confirm the padding is canonicalized, and add
returns-new (__atomic_*_fetch) and non-seq_cst ordering coverage.

Assisted-by: Claude (Anthropic)
Co-Authored-By: Claude Opus 4.6 <noreply at anthropic.com>
---
 .../test/builtins/Unit/atomic_bitint_test.c   | 37 +++++++++++++++++--
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/compiler-rt/test/builtins/Unit/atomic_bitint_test.c b/compiler-rt/test/builtins/Unit/atomic_bitint_test.c
index 33a745348a6f0..e0cc3aef61bc2 100644
--- a/compiler-rt/test/builtins/Unit/atomic_bitint_test.c
+++ b/compiler-rt/test/builtins/Unit/atomic_bitint_test.c
@@ -17,6 +17,7 @@
 //===----------------------------------------------------------------------===//
 
 #include <assert.h>
+#include <stdint.h>
 #include <stdio.h>
 
 typedef signed _BitInt(37) S37;
@@ -55,21 +56,25 @@ static void test_ops(void) {
 // Seed non-canonical padding through a union, then RMW. A loop that carried a
 // re-canonicalized expected would never match memory and hang here.
 static void test_dirty_padding(void) {
+  // uint64_t (not unsigned long, which is 32-bit on LLP64) so the padding bit
+  // is representable and the overlay matches the 8-byte atomic.
   union {
     _Atomic(S37) a;
-    unsigned long b;
+    uint64_t b;
   } s;
-  s.b = ((unsigned long)1 << 40) | 5u; // value bits 5, padding bit 40 set
+  s.b = ((uint64_t)1 << 40) | 5u; // value bits 5, padding bit 40 set
   S37 old = __c11_atomic_fetch_add(&s.a, 1, __ATOMIC_SEQ_CST);
   assert(old == 5 && (S37)s.a == 6);
+  assert((s.b >> 37) == 0); // padding canonicalized (positive value)
 
   union {
     _Atomic(U37) a;
-    unsigned long b;
+    uint64_t b;
   } u;
-  u.b = ((unsigned long)3 << 50) | 7u;
+  u.b = ((uint64_t)3 << 50) | 7u;
   U37 uold = __c11_atomic_fetch_add(&u.a, 1, __ATOMIC_SEQ_CST);
   assert(uold == 7 && (U37)u.a == 8);
+  assert((u.b >> 37) == 0); // padding canonicalized (zero-extended)
 
   // Wide padded width (libcall loop): _BitInt(200) has 56 padding bits in its
   // 32-byte storage. Set the overlay at value level (endian-independent): low
@@ -81,11 +86,35 @@ static void test_dirty_padding(void) {
   w.full = (unsigned _BitInt(256))5 | ((unsigned _BitInt(256))0xAA << 240);
   S200 wold = __c11_atomic_fetch_add(&w.a, 1, __ATOMIC_SEQ_CST);
   assert(wold == 5 && (S200)w.a == 6);
+  assert((w.full >> 200) == 0); // padding canonicalized (positive value)
+}
+
+// The _fetch builtins return the new value, not the old one.
+static void test_returns_new(void) {
+  S37 a = 100;
+  assert(__atomic_add_fetch(&a, 5, __ATOMIC_SEQ_CST) == 105);
+  assert(__atomic_sub_fetch(&a, 10, __ATOMIC_SEQ_CST) == 95);
+  U37 u = 0;
+  assert(__atomic_or_fetch(&u, 0xF, __ATOMIC_SEQ_CST) == 0xF);
+  S200 w = 100;
+  assert(__atomic_add_fetch(&w, 5, __ATOMIC_SEQ_CST) == 105);
+}
+
+// Each non-seq_cst ordering drives the loop's load/cmpxchg ordering.
+static void test_orderings(void) {
+  _Atomic(S37) a = 10;
+  (void)__c11_atomic_fetch_add(&a, 1, __ATOMIC_RELAXED);
+  (void)__c11_atomic_fetch_add(&a, 1, __ATOMIC_ACQUIRE);
+  (void)__c11_atomic_fetch_add(&a, 1, __ATOMIC_RELEASE);
+  (void)__c11_atomic_fetch_add(&a, 1, __ATOMIC_ACQ_REL);
+  assert((S37)a == 14);
 }
 
 int main(void) {
   test_ops();
   test_dirty_padding();
+  test_returns_new();
+  test_orderings();
   printf("PASS\n");
   return 0;
 }