[Mlir-commits] [mlir] [MLIR][GPU] Fix async.yield gpu.async.token lowering race (PR #190717)
Jared Hoberock
llvmlistbot at llvm.org
Mon Apr 6 17:49:57 PDT 2026
https://github.com/jaredhoberock created https://github.com/llvm/llvm-project/pull/190717
Root cause of #170833 (flakiness of `Integration/GPU/CUDA/async.mlir` on the Tesla T4 mlir-nvidia buildbot).
In `gpu-to-llvm`, two patterns matched `async.yield` with the same benefit: the structural `ConvertYieldOpTypes` from `populateAsyncStructuralTypeConversionsAndLegality` (which just retypes operands), and `ConvertAsyncYieldToGpuRuntimeCallPattern` (which also creates and records an event on the stream backing each `gpu.async.token` operand). When the IR contained `gpu.launch_func`, the dialect-conversion framework picked the structural pattern, silently dropping the event record. The `async.execute` then yielded a stream pointer where its consumers expected an event, and the host await ended up calling `cuEventSynchronize` on a stream pointer. That call returns an error without waiting, so the host raced against the GPU.
On Tesla T4 the host outraced the kernel and read stale `[42, 42]`. On faster hardware the kernel typically finished in time and the bug went unnoticed.
The fix makes `ConvertAsyncYieldToGpuRuntimeCallPattern` the sole `async.yield` rewriter in `gpu-to-llvm`, eliminating the pattern competition. See the commit message for the three concrete code changes.
#190563 (mine) suppressed `CUDA_ERROR_CONTEXT_IS_DESTROYED` on four CUDA stream/event calls. Those errors no longer occur with this PR in. I will roll back those four suppressions in a follow-up; the `cuModuleUnload` suppression should stay (separate teardown ordering issue).
## Testing
- New lit test `mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir` checks the IR shape directly. Fails on the unfixed compiler (yield references the stream pointer) and passes on the fix. No GPU required.
- `mlir/test/Conversion/GPUCommon/`, `mlir/test/Conversion/AsyncToLLVM/`, `mlir/test/Dialect/Async/` lit suites all green (37 tests).
- `mlir/test/Integration/GPU/CUDA/async.mlir` (FileCheck re-enabled locally) produces `[84, 84]` and zero cleanup errors. The upstream integration test is still FileCheck-disabled from #190702; can be re-enabled in a follow-up after this lands and bakes.
- A slow-kernel reproducer (kernel does ~1M sentinel writes before the real result, forcing the race window open on fast hardware) on an RTX A5000: unfixed compiler produces `[7, 7]` 8/10 runs (host reads mid-loop), fixed compiler produces `[84, 84]` 10/10 runs.
>From 10afec5852348762272ce61333aacd40444683af Mon Sep 17 00:00:00 2001
From: Jared Hoberock <jaredhoberock at gmail.com>
Date: Mon, 6 Apr 2026 19:37:27 -0500
Subject: [PATCH] [MLIR][GPU] Fix async.yield gpu.async.token lowering race
In gpu-to-llvm, two patterns matched async.yield with the same
benefit: the structural ConvertYieldOpTypes from
populateAsyncStructuralTypeConversionsAndLegality, which just retypes
operands, and ConvertAsyncYieldToGpuRuntimeCallPattern, which also
creates and records an event on the stream backing each
gpu.async.token operand. When the IR contained gpu.launch_func, the
dialect-conversion framework picked the structural pattern, silently
dropping the event record. The async.execute then yielded a stream
pointer where its consumers expected an event, so the host await
ended up calling cuEventSynchronize on a stream pointer. That call
returns an error without waiting, racing the host against the GPU.
This is the root cause of the flakiness tracked in #170833. On slow
hardware (e.g. the Tesla T4 mlir-nvidia buildbot) the host outraced
the kernel and read stale data; on faster hardware the kernel
typically finished in time and the bug went unnoticed.
Make ConvertAsyncYieldToGpuRuntimeCallPattern the sole async.yield
rewriter in gpu-to-llvm:
- Add an addStructuralYieldPattern parameter to
populateAsyncStructuralTypeConversionsAndLegality (default true,
preserving out-of-tree callers).
- Have gpu-to-llvm pass false so the structural yield pattern is
not registered.
- Drop the early bail-out in ConvertAsyncYieldToGpuRuntimeCallPattern
so it handles every async.yield, retyping operands without
gpu.async.token type and finalizing the stream behind operands
with gpu.async.token type.
Add a regression test that exercises gpu-to-llvm on the smallest IR
that triggers the buggy pattern dispatch: a single async.execute
whose body runs a gpu.launch_func and yields the launch's
gpu.async.token. It fails on the unfixed compiler (the yield ends up
referencing the stream) and passes once the GPU-aware pattern is the
only rewriter.
Fixes #170833.
---
.../mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h | 8 +++-
.../Conversion/AsyncToLLVM/AsyncToLLVM.cpp | 6 ++-
.../GPUCommon/GPUToLLVMConversion.cpp | 27 +++++++----
.../lower-async-to-gpu-runtime-calls.mlir | 46 +++++++++++++++++++
4 files changed, 75 insertions(+), 12 deletions(-)
create mode 100644 mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
diff --git a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
index 60441f9faaa60..892814dfd16bc 100644
--- a/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
+++ b/mlir/include/mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h
@@ -30,9 +30,15 @@ class RewritePatternSet;
/// the corresponding async.yield ops need to update their types accordingly to
/// the TypeConverter, but otherwise don't care what type conversions are
/// happening.
+///
+/// `addStructuralYieldPattern` controls whether the structural pattern for
+/// `async.yield` is registered. Callers that install their own (more
+/// specialized) `async.yield` rewriter should pass `false` to avoid pattern
+/// competition with that rewriter. The dynamic legality marker for
+/// `async.yield` is still installed regardless.
void populateAsyncStructuralTypeConversionsAndLegality(
TypeConverter &typeConverter, RewritePatternSet &patterns,
- ConversionTarget &target);
+ ConversionTarget &target, bool addStructuralYieldPattern = true);
} // namespace mlir
diff --git a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
index 29e6552231f9c..3cfd4cec3826f 100644
--- a/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
+++ b/mlir/lib/Conversion/AsyncToLLVM/AsyncToLLVM.cpp
@@ -1150,15 +1150,17 @@ class ConvertYieldOpTypes : public OpConversionPattern<async::YieldOp> {
void mlir::populateAsyncStructuralTypeConversionsAndLegality(
TypeConverter &typeConverter, RewritePatternSet &patterns,
- ConversionTarget &target) {
+ ConversionTarget &target, bool addStructuralYieldPattern) {
typeConverter.addConversion([&](TokenType type) { return type; });
typeConverter.addConversion([&](ValueType type) {
Type converted = typeConverter.convertType(type.getValueType());
return converted ? ValueType::get(converted) : converted;
});
- patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes, ConvertYieldOpTypes>(
+ patterns.add<ConvertExecuteOpTypes, ConvertAwaitOpTypes>(
typeConverter, patterns.getContext());
+ if (addStructuralYieldPattern)
+ patterns.add<ConvertYieldOpTypes>(typeConverter, patterns.getContext());
target.addDynamicallyLegalOp<AwaitOp, ExecuteOp, async::YieldOp>(
[&](Operation *op) { return typeConverter.isLegal(op); });
diff --git a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
index 3e99c537d0e02..7bb829a9d3ef1 100644
--- a/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
+++ b/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
@@ -565,8 +565,16 @@ void GpuToLLVMConversionPass::runOnOperation() {
// These aren't covered by the ConvertToLLVMPatternInterface right now.
populateVectorToLLVMConversionPatterns(converter, patterns);
populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
- populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
- target);
+ // Skip the structural async.yield pattern: we install our own
+ // ConvertAsyncYieldToGpuRuntimeCallPattern below, which handles
+ // gpu.async.token operands by recording an event on the underlying
+ // stream and yielding the event. Letting the structural pattern run
+ // would silently rewrite the yield's operands without recording an
+ // event, leaving the host await to call cuEventSynchronize on a
+ // stream pointer (a no-op that returns an error), causing a race
+ // between the host and the GPU.
+ populateAsyncStructuralTypeConversionsAndLegality(
+ converter, patterns, target, /*addStructuralYieldPattern=*/false);
populateGpuToLLVMConversionPatterns(converter, patterns,
kernelBarePtrCallConv,
kernelIntersperseSizeCallConv);
@@ -834,16 +842,17 @@ static bool isGpuAsyncTokenType(Value value) {
return isa<gpu::AsyncTokenType>(value.getType());
}
-// Converts !gpu.async.token operands of `async.yield` to runtime calls. The
-// !gpu.async.token are lowered to stream within the async.execute region, but
-// are passed as events between them. For each !gpu.async.token operand, we
-// create an event and record it on the stream.
+// Converts `async.yield` to use the GPU runtime. For each !gpu.async.token
+// operand, the underlying stream is finalized: an event is created and
+// recorded on the stream, the event takes the stream's place in the yield,
+// and the stream is destroyed. Operands without !gpu.async.token type are
+// just retyped via the type converter -- this pattern is the sole rewriter
+// for `async.yield` in the gpu-to-llvm conversion (the structural pattern
+// from `populateAsyncStructuralTypeConversionsAndLegality` is intentionally
+// not registered here, see the call site in `runOnOperation`).
LogicalResult ConvertAsyncYieldToGpuRuntimeCallPattern::matchAndRewrite(
async::YieldOp yieldOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {
- if (llvm::none_of(yieldOp.getOperands(), isGpuAsyncTokenType))
- return rewriter.notifyMatchFailure(yieldOp, "no gpu async token operand");
-
Location loc = yieldOp.getLoc();
SmallVector<Value, 4> newOperands(adaptor.getOperands());
llvm::SmallDenseSet<Value> streams;
diff --git a/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
new file mode 100644
index 0000000000000..cbb9df1ad8af3
--- /dev/null
+++ b/mlir/test/Conversion/GPUCommon/lower-async-to-gpu-runtime-calls.mlir
@@ -0,0 +1,46 @@
+// RUN: mlir-opt %s --gpu-to-llvm | FileCheck %s
+
+// Regression test for https://github.com/llvm/llvm-project/issues/170833.
+//
+// In `gpu-to-llvm`, an `async.yield` operand of type `!gpu.async.token`
+// must be lowered to an *event* recorded on the stream that produced it,
+// not to the stream pointer itself. Otherwise the host await later calls
+// `cuEventSynchronize` on a stream pointer (a no-op that returns an
+// error), and the host races against the GPU.
+//
+// The bug was that two patterns matched `async.yield` with the same
+// benefit: the structural rewriter from
+// `populateAsyncStructuralTypeConversionsAndLegality` (which only retypes
+// operands) and the GPU-aware rewriter (which also creates and records an
+// event). When the IR contained `gpu.launch_func` (so other patterns ran
+// alongside), the dialect-conversion framework picked the structural one
+// for the yield, dropping the event-record on the floor.
+
+module attributes {gpu.container_module} {
+
+ // CHECK-LABEL: llvm.func @yield_launch_token
+ // CHECK: %[[stream:.*]] = llvm.call @mgpuStreamCreate
+ // CHECK: gpu.launch_func {{.*}} @kmod::@kernel
+ // CHECK: %[[event:.*]] = llvm.call @mgpuEventCreate
+ // CHECK: llvm.call @mgpuEventRecord(%[[event]], %[[stream]])
+ // CHECK: llvm.call @mgpuStreamDestroy(%[[stream]])
+ // CHECK: async.yield %[[event]] : !llvm.ptr
+ func.func @yield_launch_token(%arg : memref<?xi32>) {
+ %c1 = arith.constant 1 : index
+ %t, %r = async.execute -> !async.value<!gpu.async.token> {
+ %0 = gpu.wait async
+ %1 = gpu.launch_func async [%0] @kmod::@kernel
+ blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)
+ args(%arg : memref<?xi32>)
+ async.yield %1 : !gpu.async.token
+ }
+ return
+ }
+
+ gpu.module @kmod [#nvvm.target] {
+ llvm.func @kernel(%a: !llvm.ptr, %b: !llvm.ptr, %c: i64, %d: i64, %e: i64)
+ attributes {gpu.kernel, nvvm.kernel} {
+ llvm.return
+ }
+ }
+}
More information about the Mlir-commits
mailing list