[flang-commits] [flang] [flang][cuda] Fix CUFDeviceGlobal duplicate skip and CUFAddConstructor for empty gpu.module (PR #194290)

Sun Apr 26 22:25:11 PDT 2026

https://github.com/khaki3 created https://github.com/llvm/llvm-project/pull/194290

**Fix 1:**

```fortran
module m
  real, device :: a(3), b(3), c(3)
contains
  attributes(global) subroutine kernel()
    a(1) = 1.0
  end subroutine
end module
```

When a kernel references global `a`, an earlier pass (`CUFDeviceFuncTransform`) clones it into the `gpu.module`. When `CUFDeviceGlobal` later processes all device globals, it finds `a` already exists, and `break` exits the loop — skipping `b` and `c` entirely.

Fix: Change `break` to `continue` in `CUFDeviceGlobal.cpp`.


**Fix 2:**

```fortran
module kernels_m
  real, device :: dev_var
contains
  attributes(global) subroutine my_kernel()
  end subroutine
end module

program main
  use kernels_m
end program
```

When a compilation unit USEs a CUDA module but defines no kernels itself, the empty `gpu.module` is still registered with the CUDA runtime, causing `cudaErrorNoKernelImageForDevice`.

Fix: Return early from the constructor if the `gpu.module` has no kernels. Also skip variable registration for globals not in the `gpu.module`.


>From 67e26c6be0d2b911e27136de477e950b745b62a7 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Sun, 26 Apr 2026 18:25:12 -0700
Subject: [PATCH 1/3] [flang][cuda] Fix break-instead-of-continue typo in
 CUFDeviceGlobal

CUFDeviceGlobal::prepareImplicitDeviceGlobals uses a loop to clone
candidate fir.global ops into the gpu.module. When it finds a global
that already exists in the gpu.module symbol table, it should skip
that candidate and continue to the next one. A typo used `break`
instead of `continue`, causing all remaining candidates to be skipped
after the first duplicate was found.

This resulted in missing device globals and cudaErrorIllegalAddress
at runtime when the kernel accessed unregistered constant memory.

Made-with: Cursor
---
 flang/lib/Optimizer/Transforms/CUDA/CUFDeviceGlobal.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/flang/lib/Optimizer/Transforms/CUDA/CUFDeviceGlobal.cpp b/flang/lib/Optimizer/Transforms/CUDA/CUFDeviceGlobal.cpp
index 10682314567b9..2046e78923498 100644
--- a/flang/lib/Optimizer/Transforms/CUDA/CUFDeviceGlobal.cpp
+++ b/flang/lib/Optimizer/Transforms/CUDA/CUFDeviceGlobal.cpp
@@ -155,7 +155,7 @@ class CUFDeviceGlobal : public fir::impl::CUFDeviceGlobalBase<CUFDeviceGlobal> {
     for (auto globalOp : candidates) {
       auto globalName{globalOp.getSymbol().getValue()};
       if (gpuSymTable.lookup<fir::GlobalOp>(globalName)) {
-        break;
+        continue;
       }
       gpuSymTable.insert(globalOp->clone());
     }

>From 054e03b2aace51abe3e11c4ed7a21defbc4e215e Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Sun, 26 Apr 2026 18:25:29 -0700
Subject: [PATCH 2/3] [flang][cuda] Skip constructor generation for host TUs
 without kernels

CUFAddConstructor generates a __cudaFortranConstructor that registers
the GPU module binary. When a host translation unit USEs a module
containing device code but does not define any kernels itself, a
spurious gpu.module is still created. Attempting to register this
empty module causes cudaErrorNoKernelImageForDevice at runtime.

Add a check to skip GPU binary registration when the gpu.module
contains no kernel functions. Also guard variable registration to
only process globals that actually exist in the gpu.module symbol
table, preventing attempts to register host-only variables as
device symbols.

Made-with: Cursor
---
 .../Transforms/CUDA/CUFAddConstructor.cpp       | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/flang/lib/Optimizer/Transforms/CUDA/CUFAddConstructor.cpp b/flang/lib/Optimizer/Transforms/CUDA/CUFAddConstructor.cpp
index c1cb52d264afa..4e7e11d27b289 100644
--- a/flang/lib/Optimizer/Transforms/CUDA/CUFAddConstructor.cpp
+++ b/flang/lib/Optimizer/Transforms/CUDA/CUFAddConstructor.cpp
@@ -118,6 +118,20 @@ struct CUFAddConstructor
 
     auto gpuMod = symTab.lookup<mlir::gpu::GPUModuleOp>(cudaDeviceModuleName);
     if (gpuMod) {
+      bool hasKernel = false;
+      for (auto func : gpuMod.getOps<mlir::gpu::GPUFuncOp>()) {
+        if (func.isKernel()) {
+          hasKernel = true;
+          break;
+        }
+      }
+      if (!hasKernel) {
+        // No kernels means no GPU binary to register. This happens for host
+        // TUs that USE a kernel module but don't define any device code.
+        builder.create<mlir::LLVM::ReturnOp>(loc, mlir::ValueRange{});
+        return;
+      }
+
       auto llvmPtrTy = mlir::LLVM::LLVMPointerType::get(ctx);
       auto registeredMod = cuf::RegisterModuleOp::create(
           builder, loc, llvmPtrTy,
@@ -137,11 +151,14 @@ struct CUFAddConstructor
       }
 
       // Register variables
+      mlir::SymbolTable gpuSymTable(gpuMod);
       bool hasNonAllocManagedGlobal = false;
       for (fir::GlobalOp globalOp : mod.getOps<fir::GlobalOp>()) {
         auto attr = globalOp.getDataAttrAttr();
         if (!attr)
           continue;
+        if (!gpuSymTable.lookup(globalOp.getSymName()))
+          continue;
 
         bool isNonAllocManagedGlobal =
             attr.getValue() == cuf::DataAttribute::Managed &&

>From a2289c5363253fb2a9f8b0d862377d6a8e1660c0 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Sun, 26 Apr 2026 22:09:28 -0700
Subject: [PATCH 3/3] [flang][cuda] Add tests for CUFDeviceGlobal and
 CUFAddConstructor fixes

Add test case to cuda-device-global.f90 verifying that globals after a
duplicate are still cloned into the gpu.module.

Add test case to cuda-constructor-2.f90 verifying that when the
gpu.module has no kernels, the constructor returns early.

Update existing test sections to include kernels and globals in
gpu.module to match the corrected behavior.

Made-with: Cursor
---
 flang/test/Fir/CUDA/cuda-constructor-2.f90 | 50 ++++++++++++++++++++++
 flang/test/Fir/CUDA/cuda-device-global.f90 | 20 +++++++++
 2 files changed, 70 insertions(+)

diff --git a/flang/test/Fir/CUDA/cuda-constructor-2.f90 b/flang/test/Fir/CUDA/cuda-constructor-2.f90
index 15f2f307b3a57..26dcddfd19cb9 100644
--- a/flang/test/Fir/CUDA/cuda-constructor-2.f90
+++ b/flang/test/Fir/CUDA/cuda-constructor-2.f90
@@ -11,6 +11,16 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
   }
 
   gpu.module @cuda_device_mod {
+    gpu.func @_QMmtestsPkernel() kernel {
+      gpu.return
+    }
+    fir.global @_QMmtestsEn(dense<[3, 4, 5, 6, 7]> : tensor<5xi32>) {data_attr = #cuf.cuda<device>} : !fir.array<5xi32>
+    fir.global @_QMmtestsEndev {data_attr = #cuf.cuda<device>} : !fir.box<!fir.heap<!fir.array<?xi32>>> {
+      %c0 = arith.constant 0 : index
+      %0 = fir.zero_bits !fir.heap<!fir.array<?xi32>>
+      %1 = fircg.ext_embox %0(%c0) {allocator_idx = 2 : i32} : (!fir.heap<!fir.array<?xi32>>, index) -> !fir.box<!fir.heap<!fir.array<?xi32>>>
+      fir.has_value %1 : !fir.box<!fir.heap<!fir.array<?xi32>>>
+    }
   }
 }
 
@@ -72,6 +82,16 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<i8 = dense<8> : vector<2xi64>, i
     fir.has_value %2 : !fir.box<!fir.heap<!fir.array<?x?x?x?x?xf64>>>
   }
   gpu.module @cuda_device_mod {
+    gpu.func @_QMmPkernel() kernel {
+      gpu.return
+    }
+    fir.global @_QMmEa00 {data_attr = #cuf.cuda<managed>} : !fir.box<!fir.heap<!fir.array<?x?x?x?x?xf64>>> {
+      %c0 = arith.constant 0 : index
+      %0 = fir.zero_bits !fir.heap<!fir.array<?x?x?x?x?xf64>>
+      %1 = fir.shape %c0, %c0, %c0, %c0, %c0 : (index, index, index, index, index) -> !fir.shape<5>
+      %2 = fir.embox %0(%1) {allocator_idx = 3 : i32} : (!fir.heap<!fir.array<?x?x?x?x?xf64>>, !fir.shape<5>) -> !fir.box<!fir.heap<!fir.array<?x?x?x?x?xf64>>>
+      fir.has_value %2 : !fir.box<!fir.heap<!fir.array<?x?x?x?x?xf64>>>
+    }
   }
 }
 
@@ -101,6 +121,13 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
   }
 
   gpu.module @cuda_device_mod {
+    gpu.func @_QMtestPkernel() kernel {
+      gpu.return
+    }
+    fir.global @_QMtestEmanx {data_attr = #cuf.cuda<managed>} : !fir.array<100xi32> {
+      %0 = fir.zero_bits !fir.array<100xi32>
+      fir.has_value %0 : !fir.array<100xi32>
+    }
   }
 }
 
@@ -113,3 +140,26 @@ module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<
 // CHECK: fir.address_of(@_QMtestEmanx.managed.ptr) : !fir.ref<!fir.llvm_ptr<i8>>
 // CHECK: fir.call @_FortranACUFRegisterManagedVariable
 // CHECK: fir.call @_FortranACUFInitModule
+
+// -----
+
+// Test that when the gpu.module has no kernels (e.g., host TU that USEs
+// a kernel module), the constructor returns early without registering.
+
+module attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<!llvm.ptr, dense<64> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<271>, dense<32> : vector<4xi64>>, #dlti.dl_entry<!llvm.ptr<270>, dense<32> : vector<4xi64>>, #dlti.dl_entry<f128, dense<128> : vector<2xi64>>, #dlti.dl_entry<f64, dense<64> : vector<2xi64>>, #dlti.dl_entry<f80, dense<128> : vector<2xi64>>, #dlti.dl_entry<f16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i32, dense<32> : vector<2xi64>>, #dlti.dl_entry<i16, dense<16> : vector<2xi64>>, #dlti.dl_entry<i128, dense<128> : vector<2xi64>>, #dlti.dl_entry<i8, dense<8> : vector<2xi64>>, #dlti.dl_entry<!llvm.ptr<272>, dense<64> : vector<4xi64>>, #dlti.dl_entry<i64, dense<64> : vector<2xi64>>, #dlti.dl_entry<i1, dense<8> : vector<2xi64>>, #dlti.dl_entry<"dlti.endianness", "little">, #dlti.dl_entry<"dlti.stack_alignment", 128 : i64>>, fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", gpu.container_module, llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"} {
+
+  fir.global @_QMkernels_mEdev_var {data_attr = #cuf.cuda<device>} : f32 {
+    %0 = arith.constant 0.0 : f32
+    fir.has_value %0 : f32
+  }
+
+  gpu.module @cuda_device_mod {
+    gpu.func @_QMkernels_mPnot_a_kernel() {
+      gpu.return
+    }
+  }
+}
+
+// CHECK: llvm.func internal @__cudaFortranConstructor()
+// CHECK-NEXT: llvm.call @_FortranACUFRegisterAllocator()
+// CHECK-NEXT: llvm.return
diff --git a/flang/test/Fir/CUDA/cuda-device-global.f90 b/flang/test/Fir/CUDA/cuda-device-global.f90
index 7edcf1a4b13c5..3be307a1bbf7e 100644
--- a/flang/test/Fir/CUDA/cuda-device-global.f90
+++ b/flang/test/Fir/CUDA/cuda-device-global.f90
@@ -65,3 +65,23 @@ module attributes {fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", gpu.conta
 // CHECK-LABEL: gpu.module @cuda_device_mod
 // CHECK: fir.global linkonce_odr @_QMvector_typesE.dt.v2real2
 
+// -----
+
+// Test that when a global already exists in the gpu.module, the pass
+// continues cloning the remaining candidates instead of stopping.
+
+module attributes {fir.defaultkind = "a1c4d8i4l4r4", fir.kindmap = "", gpu.container_module} {
+  fir.global @_QMmEa(dense<[1, 2, 3]> : tensor<3xi32>) {data_attr = #cuf.cuda<device>} : !fir.array<3xi32>
+  fir.global @_QMmEb(dense<[10, 20, 30]> : tensor<3xi32>) {data_attr = #cuf.cuda<device>} : !fir.array<3xi32>
+  fir.global @_QMmEc(dense<[100, 200, 300]> : tensor<3xi32>) {data_attr = #cuf.cuda<device>} : !fir.array<3xi32>
+
+  gpu.module @cuda_device_mod {
+    fir.global @_QMmEa(dense<[1, 2, 3]> : tensor<3xi32>) {data_attr = #cuf.cuda<device>} : !fir.array<3xi32>
+  }
+}
+
+// CHECK: gpu.module @cuda_device_mod
+// CHECK-DAG: fir.global @_QMmEa
+// CHECK-DAG: fir.global @_QMmEb
+// CHECK-DAG: fir.global @_QMmEc
+