[Mlir-commits] [mlir] [MLIR][GPU] Fix async.yield gpu.async.token lowering race (PR #190717)

Mon Apr 6 17:50:28 PDT 2026

llvmbot wrote:




@llvm/pr-subscribers-mlir-gpu

Author: Jared Hoberock (jaredhoberock)

<details>
<summary>Changes</summary>

Root cause of #170833 (flakiness of `Integration/GPU/CUDA/async.mlir` on the Tesla T4 mlir-nvidia buildbot).

In `gpu-to-llvm`, two patterns matched `async.yield` with the same benefit: the structural `ConvertYieldOpTypes` from `populateAsyncStructuralTypeConversionsAndLegality` (which just retypes operands), and `ConvertAsyncYieldToGpuRuntimeCallPattern` (which also creates and records an event on the stream backing each `gpu.async.token` operand). When the IR contained `gpu.launch_func`, the dialect-conversion framework picked the structural pattern, silently dropping the event record. The `async.execute` then yielded a stream pointer where its consumers expected an event, and the host await ended up calling `cuEventSynchronize` on a stream pointer. That call returns an error without waiting, so the host raced against the GPU.

On Tesla T4 the host outraced the kernel and read stale `[42, 42]`. On faster hardware the kernel typically finished in time and the bug went unnoticed.

The fix makes `ConvertAsyncYieldToGpuRuntimeCallPattern` the sole `async.yield` rewriter in `gpu-to-llvm`, eliminating the pattern competition. See the commit message for the three concrete code changes.

#190563 (mine) suppressed `CUDA_ERROR_CONTEXT_IS_DESTROYED` on four CUDA stream/event calls. Those errors no longer occur with this PR in. I will roll back those four suppressions in a follow-up; the `cuModuleUnload` suppression should stay (separate teardown ordering issue).

## Testing                                                                                                                                                                                                                                                                            
                                                          
- New lit test `mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir` checks the IR shape directly. Fails on the unfixed compiler (yield references the stream pointer) and passes on the fix. No GPU required.
- `mlir/test/Conversion/GPUCommon/`, `mlir/test/Conversion/AsyncToLLVM/`, `mlir/test/Dialect/Async/` lit suites all green (37 tests).
- `mlir/test/Integration/GPU/CUDA/async.mlir` (FileCheck re-enabled locally) produces `[84, 84]` and zero cleanup errors. The upstream integration test is still FileCheck-disabled from #190702; can be re-enabled in a follow-up after this lands and bakes.
- A slow-kernel reproducer (kernel does ~1M sentinel writes before the real result, forcing the race window open on fast hardware) on an RTX A5000: unfixed compiler produces `[7, 7]` 8/10 runs (host reads mid-loop), fixed compiler produces `[84, 84]` 10/10 runs.

---
Full diff: https://github.com/llvm/llvm-project/pull/190717.diff


4 Files Affected:

- (modified) mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h (+7-1) 
- (modified) mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp (+4-2) 
- (modified) mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp (+18-9) 
- (added) mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir (+46) 


``````````diff

diff --git a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
index 60441f9faaa60..892814dfd16bc 100644
--- a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
+++ b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
@@ -30,9 +30,15 @@ class RewritePatternSet;
 /// the corresponding async.yield ops need to update their types accordingly to
 /// the TypeConverter, but otherwise don't care what type conversions are
 /// happening.
+///
+/// `addStructuralYieldPattern` controls whether the structural pattern for
+/// `async.yield` is registered. Callers that install their own (more
+/// specialized) `async.yield` rewriter should pass `false` to avoid pattern
+/// competition with that rewriter. The dynamic legality marker for
+/// `async.yield` is still installed regardless.
 void populateAsyncStructuralTypeConversionsAndLegality(
     TypeConverter &typeConverter, RewritePatternSet &patterns,
-    ConversionTarget &target);
+    ConversionTarget &target, bool addStructuralYieldPattern = true);
 
 } // namespace mlir
 
diff --git a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
index 29e6552231f9c..3cfd4cec3826f 100644
--- a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
+++ b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
@@ -1150,15 +1150,17 @@ class ConvertYieldOpTypes : public OpConversionPattern<async::YieldOp> {
 
 void mlir::populateAsyncStructuralTypeConversionsAndLegality(
     TypeConverter &typeConverter, RewritePatternSet &patterns,
-    ConversionTarget &target) {
+    ConversionTarget &target, bool addStructuralYieldPattern) {
   typeConverter.addConversion([&](TokenType type) { return type; });
   typeConverter.addConversion([&](ValueType type) {
     Type converted = typeConverter.convertType(type.getValueType());
     return converted ? ValueType::get(converted) : converted;
   });
 
-  patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes, ConvertYieldOpTypes>(
+  patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes>(
       typeConverter, patterns.getContext());
+  if (addStructuralYieldPattern)
+    patterns.add<ConvertYieldOpTypes>(typeConverter, patterns.getContext());
 
   target.addDynamicallyLegalOp<AwaitOp, ExecuteOp, async::YieldOp>(
       [&](Operation *op) { return typeConverter.isLegal(op); });
diff --git a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
index 3e99c537d0e02..7bb829a9d3ef1 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
@@ -565,8 +565,16 @@ void GpuToLLVMConversionPass::runOnOperation() {
   // These aren't covered by the ConvertToLLVMPatternInterface right now.
   populateVectorToLLVMConversionPatterns(converter, patterns);
   populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
-  populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
-                                                    target);
+  // Skip the structural async.yield pattern: we install our own
+  // ConvertAsyncYieldToGpuRuntimeCallPattern below, which handles
+  // gpu.async.token operands by recording an event on the underlying
+  // stream and yielding the event. Letting the structural pattern run
+  // would silently rewrite the yield's operands without recording an
+  // event, leaving the host await to call cuEventSynchronize on a
+  // stream pointer (a no-op that returns an error), causing a race
+  // between the host and the GPU.
+  populateAsyncStructuralTypeConversionsAndLegality(
+      converter, patterns, target, /*addStructuralYieldPattern=*/false);
   populateGpuToLLVMConversionPatterns(converter, patterns,
                                       kernelBarePtrCallConv,
                                       kernelIntersperseSizeCallConv);
@@ -834,16 +842,17 @@ static bool isGpuAsyncTokenType(Value value) {
   return isa<gpu::AsyncTokenType>(value.getType());
 }
 
-// Converts !gpu.async.token operands of `async.yield` to runtime calls. The
-// !gpu.async.token are lowered to stream within the async.execute region, but
-// are passed as events between them. For each !gpu.async.token operand, we
-// create an event and record it on the stream.
+// Converts `async.yield` to use the GPU runtime. For each !gpu.async.token
+// operand, the underlying stream is finalized: an event is created and
+// recorded on the stream, the event takes the stream's place in the yield,
+// and the stream is destroyed. Operands without !gpu.async.token type are
+// just retyped via the type converter -- this pattern is the sole rewriter
+// for `async.yield` in the gpu-to-llvm conversion (the structural pattern
+// from `populateAsyncStructuralTypeConversionsAndLegality` is intentionally
+// not registered here, see the call site in `runOnOperation`).
 LogicalResult ConvertAsyncYieldToGpuRuntimeCallPattern::matchAndRewrite(
     async::YieldOp yieldOp, OpAdaptor adaptor,
     ConversionPatternRewriter &rewriter) const {
-  if (llvm::none_of(yieldOp.getOperands(), isGpuAsyncTokenType))
-    return rewriter.notifyMatchFailure(yieldOp, "no gpu async token operand");
-
   Location loc = yieldOp.getLoc();
   SmallVector<Value, 4> newOperands(adaptor.getOperands());
   llvm::SmallDenseSet<Value> streams;
diff --git a/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
new file mode 100644
index 0000000000000..cbb9df1ad8af3
--- /dev/null
+++ b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
@@ -0,0 +1,46 @@
+// RUN: mlir-opt %s --gpu-to-llvm | FileCheck %s
+
+// Regression test for https://github.com/llvm/llvm-project/issues/170833.
+//
+// In `gpu-to-llvm`, an `async.yield` operand of type `!gpu.async.token`
+// must be lowered to an *event* recorded on the stream that produced it,
+// not to the stream pointer itself. Otherwise the host await later calls
+// `cuEventSynchronize` on a stream pointer (a no-op that returns an
+// error), and the host races against the GPU.
+//
+// The bug was that two patterns matched `async.yield` with the same
+// benefit: the structural rewriter from
+// `populateAsyncStructuralTypeConversionsAndLegality` (which only retypes
+// operands) and the GPU-aware rewriter (which also creates and records an
+// event). When the IR contained `gpu.launch_func` (so other patterns ran
+// alongside), the dialect-conversion framework picked the structural one
+// for the yield, dropping the event-record on the floor.
+
+module attributes {gpu.container_module} {
+
+  // CHECK-LABEL: llvm.func @yield_launch_token
+  // CHECK: %[[stream:.*]] = llvm.call @mgpuStreamCreate
+  // CHECK: gpu.launch_func {{.*}} @kmod::@kernel
+  // CHECK: %[[event:.*]] = llvm.call @mgpuEventCreate
+  // CHECK: llvm.call @mgpuEventRecord(%[[event]], %[[stream]])
+  // CHECK: llvm.call @mgpuStreamDestroy(%[[stream]])
+  // CHECK: async.yield %[[event]] : !llvm.ptr
+  func.func @yield_launch_token(%arg : memref<?xi32>) {
+    %c1 = arith.constant 1 : index
+    %t, %r = async.execute -> !async.value<!gpu.async.token> {
+      %0 = gpu.wait async
+      %1 = gpu.launch_func async [%0] @kmod::@kernel
+          blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
+          args(%arg : memref<?xi32>)
+      async.yield %1 : !gpu.async.token
+    }
+    return
+  }
+
+  gpu.module @kmod [#nvvm.target] {
+    llvm.func @kernel(%a: !llvm.ptr, %b: !llvm.ptr, %c: i64, %d: i64, %e: i64)
+        attributes {gpu.kernel, nvvm.kernel} {
+      llvm.return
+    }
+  }
+}

``````````

</details>


https://github.com/llvm/llvm-project/pull/190717