[Mlir-commits] [mlir] [MLIR][GPU] Fix async.yield gpu.async.token lowering race (PR #190717)

Mon Apr 6 17:49:57 PDT 2026

https://github.com/jaredhoberock created https://github.com/llvm/llvm-project/pull/190717

Root cause of #170833 (flakiness of `Integration/GPU/CUDA/async.mlir` on the Tesla T4 mlir-nvidia buildbot).

In `gpu-to-llvm`, two patterns matched `async.yield` with the same benefit: the structural `ConvertYieldOpTypes` from `populateAsyncStructuralTypeConversionsAndLegality` (which just retypes operands), and `ConvertAsyncYieldToGpuRuntimeCallPattern` (which also creates and records an event on the stream backing each `gpu.async.token` operand). When the IR contained `gpu.launch_func`, the dialect-conversion framework picked the structural pattern, silently dropping the event record. The `async.execute` then yielded a stream pointer where its consumers expected an event, and the host await ended up calling `cuEventSynchronize` on a stream pointer. That call returns an error without waiting, so the host raced against the GPU.

On Tesla T4 the host outraced the kernel and read stale `[42, 42]`. On faster hardware the kernel typically finished in time and the bug went unnoticed.

The fix makes `ConvertAsyncYieldToGpuRuntimeCallPattern` the sole `async.yield` rewriter in `gpu-to-llvm`, eliminating the pattern competition. See the commit message for the three concrete code changes.

#190563 (mine) suppressed `CUDA_ERROR_CONTEXT_IS_DESTROYED` on four CUDA stream/event calls. Those errors no longer occur with this PR in. I will roll back those four suppressions in a follow-up; the `cuModuleUnload` suppression should stay (separate teardown ordering issue).

## Testing                                                                                                                                                                                                                                                                            
                                                          
- New lit test `mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir` checks the IR shape directly. Fails on the unfixed compiler (yield references the stream pointer) and passes on the fix. No GPU required.
- `mlir/test/Conversion/GPUCommon/`, `mlir/test/Conversion/AsyncToLLVM/`, `mlir/test/Dialect/Async/` lit suites all green (37 tests).
- `mlir/test/Integration/GPU/CUDA/async.mlir` (FileCheck re-enabled locally) produces `[84, 84]` and zero cleanup errors. The upstream integration test is still FileCheck-disabled from #190702; can be re-enabled in a follow-up after this lands and bakes.
- A slow-kernel reproducer (kernel does ~1M sentinel writes before the real result, forcing the race window open on fast hardware) on an RTX A5000: unfixed compiler produces `[7, 7]` 8/10 runs (host reads mid-loop), fixed compiler produces `[84, 84]` 10/10 runs.

>From 10afec5852348762272ce61333aacd40444683af Mon Sep 17 00:00:00 2001
From: Jared Hoberock <jaredhoberock at gmail.com>
Date: Mon, 6 Apr 2026 19:37:27 -0500
Subject: [PATCH] [MLIR][GPU] Fix async.yield gpu.async.token lowering race

In gpu-to-llvm, two patterns matched async.yield with the same
benefit: the structural ConvertYieldOpTypes from
populateAsyncStructuralTypeConversionsAndLegality, which just retypes
operands, and ConvertAsyncYieldToGpuRuntimeCallPattern, which also
creates and records an event on the stream backing each
gpu.async.token operand. When the IR contained gpu.launch_func, the
dialect-conversion framework picked the structural pattern, silently
dropping the event record. The async.execute then yielded a stream
pointer where its consumers expected an event, so the host await
ended up calling cuEventSynchronize on a stream pointer. That call
returns an error without waiting, racing the host against the GPU.

This is the root cause of the flakiness tracked in #170833. On slow
hardware (e.g. the Tesla T4 mlir-nvidia buildbot) the host outraced
the kernel and read stale data; on faster hardware the kernel
typically finished in time and the bug went unnoticed.

Make ConvertAsyncYieldToGpuRuntimeCallPattern the sole async.yield
rewriter in gpu-to-llvm:

 - Add an addStructuralYieldPattern parameter to
   populateAsyncStructuralTypeConversionsAndLegality (default true,
   preserving out-of-tree callers).
 - Have gpu-to-llvm pass false so the structural yield pattern is
   not registered.
 - Drop the early bail-out in ConvertAsyncYieldToGpuRuntimeCallPattern
   so it handles every async.yield, retyping operands without
   gpu.async.token type and finalizing the stream behind operands
   with gpu.async.token type.

Add a regression test that exercises gpu-to-llvm on the smallest IR
that triggers the buggy pattern dispatch: a single async.execute
whose body runs a gpu.launch_func and yields the launch's
gpu.async.token. It fails on the unfixed compiler (the yield ends up
referencing the stream) and passes once the GPU-aware pattern is the
only rewriter.

Fixes #170833.
---
 .../mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h |  8 +++-
 .../Conversion/AsyncToLLVM/AsyncToLLVM.cpp    |  6 ++-
 .../GPUCommon/GPUToLLVMConversion.cpp         | 27 +++++++----
 .../lower-async-to-gpu-runtime-calls.mlir     | 46 +++++++++++++++++++
 4 files changed, 75 insertions(+), 12 deletions(-)
 create mode 100644 mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir

diff --git a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
index 60441f9faaa60..892814dfd16bc 100644
--- a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
+++ b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
@@ -30,9 +30,15 @@ class RewritePatternSet;
 /// the corresponding async.yield ops need to update their types accordingly to
 /// the TypeConverter, but otherwise don't care what type conversions are
 /// happening.
+///
+/// `addStructuralYieldPattern` controls whether the structural pattern for
+/// `async.yield` is registered. Callers that install their own (more
+/// specialized) `async.yield` rewriter should pass `false` to avoid pattern
+/// competition with that rewriter. The dynamic legality marker for
+/// `async.yield` is still installed regardless.
 void populateAsyncStructuralTypeConversionsAndLegality(
     TypeConverter &typeConverter, RewritePatternSet &patterns,
-    ConversionTarget &target);
+    ConversionTarget &target, bool addStructuralYieldPattern = true);
 
 } // namespace mlir
 
diff --git a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
index 29e6552231f9c..3cfd4cec3826f 100644
--- a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
+++ b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
@@ -1150,15 +1150,17 @@ class ConvertYieldOpTypes : public OpConversionPattern<async::YieldOp> {
 
 void mlir::populateAsyncStructuralTypeConversionsAndLegality(
     TypeConverter &typeConverter, RewritePatternSet &patterns,
-    ConversionTarget &target) {
+    ConversionTarget &target, bool addStructuralYieldPattern) {
   typeConverter.addConversion([&](TokenType type) { return type; });
   typeConverter.addConversion([&](ValueType type) {
     Type converted = typeConverter.convertType(type.getValueType());
     return converted ? ValueType::get(converted) : converted;
   });
 
-  patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes, ConvertYieldOpTypes>(
+  patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes>(
       typeConverter, patterns.getContext());
+  if (addStructuralYieldPattern)
+    patterns.add<ConvertYieldOpTypes>(typeConverter, patterns.getContext());
 
   target.addDynamicallyLegalOp<AwaitOp, ExecuteOp, async::YieldOp>(
       [&](Operation *op) { return typeConverter.isLegal(op); });
diff --git a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
index 3e99c537d0e02..7bb829a9d3ef1 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
@@ -565,8 +565,16 @@ void GpuToLLVMConversionPass::runOnOperation() {
   // These aren't covered by the ConvertToLLVMPatternInterface right now.
   populateVectorToLLVMConversionPatterns(converter, patterns);
   populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
-  populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
-                                                    target);
+  // Skip the structural async.yield pattern: we install our own
+  // ConvertAsyncYieldToGpuRuntimeCallPattern below, which handles
+  // gpu.async.token operands by recording an event on the underlying
+  // stream and yielding the event. Letting the structural pattern run
+  // would silently rewrite the yield's operands without recording an
+  // event, leaving the host await to call cuEventSynchronize on a
+  // stream pointer (a no-op that returns an error), causing a race
+  // between the host and the GPU.
+  populateAsyncStructuralTypeConversionsAndLegality(
+      converter, patterns, target, /*addStructuralYieldPattern=*/false);
   populateGpuToLLVMConversionPatterns(converter, patterns,
                                       kernelBarePtrCallConv,
                                       kernelIntersperseSizeCallConv);
@@ -834,16 +842,17 @@ static bool isGpuAsyncTokenType(Value value) {
   return isa<gpu::AsyncTokenType>(value.getType());
 }
 
-// Converts !gpu.async.token operands of `async.yield` to runtime calls. The
-// !gpu.async.token are lowered to stream within the async.execute region, but
-// are passed as events between them. For each !gpu.async.token operand, we
-// create an event and record it on the stream.
+// Converts `async.yield` to use the GPU runtime. For each !gpu.async.token
+// operand, the underlying stream is finalized: an event is created and
+// recorded on the stream, the event takes the stream's place in the yield,
+// and the stream is destroyed. Operands without !gpu.async.token type are
+// just retyped via the type converter -- this pattern is the sole rewriter
+// for `async.yield` in the gpu-to-llvm conversion (the structural pattern
+// from `populateAsyncStructuralTypeConversionsAndLegality` is intentionally
+// not registered here, see the call site in `runOnOperation`).
 LogicalResult ConvertAsyncYieldToGpuRuntimeCallPattern::matchAndRewrite(
     async::YieldOp yieldOp, OpAdaptor adaptor,
     ConversionPatternRewriter &rewriter) const {
-  if (llvm::none_of(yieldOp.getOperands(), isGpuAsyncTokenType))
-    return rewriter.notifyMatchFailure(yieldOp, "no gpu async token operand");
-
   Location loc = yieldOp.getLoc();
   SmallVector<Value, 4> newOperands(adaptor.getOperands());
   llvm::SmallDenseSet<Value> streams;
diff --git a/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
new file mode 100644
index 0000000000000..cbb9df1ad8af3
--- /dev/null
+++ b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
@@ -0,0 +1,46 @@
+// RUN: mlir-opt %s --gpu-to-llvm | FileCheck %s
+
+// Regression test for https://github.com/llvm/llvm-project/issues/170833.
+//
+// In `gpu-to-llvm`, an `async.yield` operand of type `!gpu.async.token`
+// must be lowered to an *event* recorded on the stream that produced it,
+// not to the stream pointer itself. Otherwise the host await later calls
+// `cuEventSynchronize` on a stream pointer (a no-op that returns an
+// error), and the host races against the GPU.
+//
+// The bug was that two patterns matched `async.yield` with the same
+// benefit: the structural rewriter from
+// `populateAsyncStructuralTypeConversionsAndLegality` (which only retypes
+// operands) and the GPU-aware rewriter (which also creates and records an
+// event). When the IR contained `gpu.launch_func` (so other patterns ran
+// alongside), the dialect-conversion framework picked the structural one
+// for the yield, dropping the event-record on the floor.
+
+module attributes {gpu.container_module} {
+
+  // CHECK-LABEL: llvm.func @yield_launch_token
+  // CHECK: %[[stream:.*]] = llvm.call @mgpuStreamCreate
+  // CHECK: gpu.launch_func {{.*}} @kmod::@kernel
+  // CHECK: %[[event:.*]] = llvm.call @mgpuEventCreate
+  // CHECK: llvm.call @mgpuEventRecord(%[[event]], %[[stream]])
+  // CHECK: llvm.call @mgpuStreamDestroy(%[[stream]])
+  // CHECK: async.yield %[[event]] : !llvm.ptr
+  func.func @yield_launch_token(%arg : memref<?xi32>) {
+    %c1 = arith.constant 1 : index
+    %t, %r = async.execute -> !async.value<!gpu.async.token> {
+      %0 = gpu.wait async
+      %1 = gpu.launch_func async [%0] @kmod::@kernel
+          blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
+          args(%arg : memref<?xi32>)
+      async.yield %1 : !gpu.async.token
+    }
+    return
+  }
+
+  gpu.module @kmod [#nvvm.target] {
+    llvm.func @kernel(%a: !llvm.ptr, %b: !llvm.ptr, %c: i64, %d: i64, %e: i64)
+        attributes {gpu.kernel, nvvm.kernel} {
+      llvm.return
+    }
+  }
+}