[Mlir-commits] [flang] [mlir] [flang][acc] Fix collapse(force:N) + reduction producing redundant inner loop (PR #191312)

Thu Apr 9 15:18:42 PDT 2026

https://github.com/khaki3 created https://github.com/llvm/llvm-project/pull/191312

When `collapse(force:2) reduction(+:r)` is applied to a nested loop, Flang's lowering creates an outer `acc.loop` with N control variables covering all collapsed dimensions. However, the PFT evaluation walker still encounters the inner `do` constructs and calls `genOpenACCLoopFromDoConstruct` for each, generating redundant inner `acc.loop` ops that re-iterate dimensions already distributed by the outer.

**Root cause**: `processDoLoopBounds` absorbs N levels of loop bounds into the outer `acc.loop`, but nothing prevents the PFT walker from also generating separate `acc.loop` ops for those same inner `do` constructs.

**Fix**: Add `isInsideCollapsedACCLoop()` which checks if the current insertion point is inside a collapsed `acc.loop` with unconsumed dimensions. In `Bridge.cpp`, when this returns true, skip both `acc.loop` generation and normal loop generation for the inner `do` construct — just lower the body directly. The IV is already available from the parent `acc.loop`'s block argument.

>From f20c1ef90caa63bccb7115d457447aeee047a0f2 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Tue, 7 Apr 2026 16:02:49 -0700
Subject: [PATCH 1/5] [OpenACC] Fix collapse(force:N) + reduction producing
 redundant inner loop

When Flang lowers `collapse(force:2) reduction(+:r)`, the outer acc.loop
gets N control variables for the collapsed dimensions, but the body
still contains an inner acc.loop that re-iterates one of those same
dimensions. When ACCComputeLowering converts these to SCF, the outer
becomes scf.parallel with N IVs and the inner becomes scf.for, making
each GPU thread iterate the full inner dimension instead of computing
a single (i,j) element.

Add a pre-pass in ACCComputeLowering that, before any acc.loop-to-SCF
conversion, walks collapsed acc.loop ops and eliminates redundant inner
acc.loop ops whose bounds match a collapsed dimension. The inner loop's
body is inlined using the outer's corresponding IV.

Fixes nv2113281.

Made-with: Cursor
---
 .../OpenACC/Transforms/ACCComputeLowering.cpp | 68 +++++++++++++++++++
 1 file changed, 68 insertions(+)

diff --git a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
index 9cc36312d3615..6d9d75eb15c24 100644
--- a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
+++ b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
@@ -255,6 +255,65 @@ assignKnownLaunchArgs<SerialOp>(SerialOp, DeviceType, RewriterBase &,
 // Loop conversion pattern
 //===----------------------------------------------------------------------===//
 
+/// When collapse(force:N) generates an outer acc.loop with N control
+/// variables, the body may still contain inner acc.loop ops that re-iterate
+/// dimensions already covered by the collapse.  Before converting the outer
+/// loop to scf.parallel, eliminate these redundant inner loops by inlining
+/// their bodies and replacing their IVs with the corresponding outer IVs.
+static void eliminateCollapsedInnerLoops(LoopOp outerLoop,
+                                         RewriterBase &rewriter) {
+  auto collapseVal = outerLoop.getCollapseValue();
+  if (!collapseVal || *collapseVal <= 1)
+    return;
+
+  unsigned numOuterIVs = outerLoop.getBody().getNumArguments();
+  if (numOuterIVs <= 1)
+    return;
+
+  SmallVector<LoopOp> innerLoopsToErase;
+
+  outerLoop.getBody().walk([&](LoopOp innerLoop) {
+    if (innerLoop == outerLoop)
+      return WalkResult::advance();
+    if (innerLoop.getBody().getNumArguments() != 1)
+      return WalkResult::advance();
+
+    // Check if the inner loop's bounds match any of the outer loop's
+    // collapsed dimensions (by comparing the original lb/ub/step values).
+    Value innerLB = innerLoop.getLowerbound()[0];
+    Value innerUB = innerLoop.getUpperbound()[0];
+    Value innerStep = innerLoop.getStep()[0];
+
+    for (unsigned ivIdx = 0; ivIdx < numOuterIVs; ++ivIdx) {
+      Value outerLB = outerLoop.getLowerbound()[ivIdx];
+      Value outerUB = outerLoop.getUpperbound()[ivIdx];
+      Value outerStep = outerLoop.getStep()[ivIdx];
+
+      if (innerLB != outerLB || innerUB != outerUB || innerStep != outerStep)
+        continue;
+
+      // Match found — replace the inner loop's IV with the outer's.
+      Value outerIV = outerLoop.getBody().getArgument(ivIdx);
+      Value innerIV = innerLoop.getBody().getArgument(0);
+
+      IRMapping mapping;
+      mapping.map(innerIV, outerIV);
+
+      rewriter.setInsertionPoint(innerLoop);
+      cloneACCRegionInto(&innerLoop.getRegion(), innerLoop->getBlock(),
+                         Block::iterator(innerLoop), mapping,
+                         innerLoop->getResults());
+
+      innerLoopsToErase.push_back(innerLoop);
+      return WalkResult::interrupt();
+    }
+    return WalkResult::advance();
+  });
+
+  for (LoopOp loop : innerLoopsToErase)
+    rewriter.eraseOp(loop);
+}
+
 class ACCLoopConversion : public OpRewritePattern<LoopOp> {
 public:
   ACCLoopConversion(MLIRContext *ctx, const ACCToGPUMappingPolicy &policy,
@@ -384,6 +443,15 @@ class ACCComputeLowering
 
     DefaultACCToGPUMappingPolicy policy;
 
+    // Pre-pass: eliminate redundant inner acc.loop ops that re-iterate
+    // dimensions already covered by a collapsed outer acc.loop.  This must
+    // happen before the greedy loop conversion because the inner loops
+    // would otherwise be converted to scf.for independently.
+    IRRewriter rewriter(context);
+    op.walk([&](LoopOp outerLoop) {
+      eliminateCollapsedInnerLoops(outerLoop, rewriter);
+    });
+
     // Part 1: Convert acc.loop to scf.parallel/scf.for while the parent
     // compute construct is still present (needed to determine conversion
     // strategy).

>From aae4335a3e186cc74641cee7dd61849f1d81e1d1 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Tue, 7 Apr 2026 16:08:53 -0700
Subject: [PATCH 2/5] [OpenACC] Add test for collapse(force:N) inner loop
 elimination

Made-with: Cursor
---
 ...c-compute-lowering-collapse-reduction.mlir | 69 +++++++++++++++++++
 1 file changed, 69 insertions(+)
 create mode 100644 mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir

diff --git a/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir b/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir
new file mode 100644
index 0000000000000..196b7c234428f
--- /dev/null
+++ b/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir
@@ -0,0 +1,69 @@
+// RUN: mlir-opt %s -acc-compute-lowering | FileCheck %s
+
+// Test that collapse(force:2) with a redundant inner acc.loop eliminates
+// the inner loop.  The outer acc.loop has 2 control variables (for the
+// collapsed dimensions) but the body contains an inner acc.loop that
+// re-iterates the second dimension.  The inner loop should be eliminated
+// and its IV replaced with the outer's corresponding IV.
+
+// CHECK-LABEL: func.func @collapse_force_with_redundant_inner
+// CHECK:       acc.compute_region
+// After conversion, the scf.parallel body should NOT contain an scf.for
+// CHECK:       scf.parallel
+// CHECK-NOT:   scf.for
+// CHECK:       scf.reduce
+func.func @collapse_force_with_redundant_inner(%buf: memref<1xf32>) {
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  %c100 = arith.constant 100 : index
+  %c200 = arith.constant 200 : index
+
+  %dev = acc.copyin varPtr(%buf : memref<1xf32>) -> memref<1xf32>
+  acc.parallel dataOperands(%dev : memref<1xf32>) {
+    acc.loop combined(parallel) control(%i : index, %j : index) = (%c1, %c1 : index, index) to (%c100, %c200 : index, index) step (%c1, %c1 : index, index) {
+      %vi = arith.index_cast %i : index to i32
+      %vj = arith.index_cast %j : index to i32
+      %sum = arith.addi %vi, %vj : i32
+      %valf = arith.sitofp %sum : i32 to f32
+      memref.store %valf, %dev[%c0] : memref<1xf32>
+      acc.loop control(%j_inner : index) = (%c1 : index) to (%c200 : index) step (%c1 : index) {
+        %vj2 = arith.index_cast %j_inner : index to i32
+        %valf2 = arith.sitofp %vj2 : i32 to f32
+        memref.store %valf2, %dev[%c0] : memref<1xf32>
+        acc.yield
+      } attributes {auto_ = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true>}
+      acc.yield
+    } attributes {collapse = [2], collapseDeviceType = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true, true>, independent = [#acc.device_type<none>]}
+    acc.yield
+  }
+  acc.copyout accPtr(%dev : memref<1xf32>) to varPtr(%buf : memref<1xf32>)
+  return
+}
+
+// -----
+
+// Verify that a collapsed loop without a redundant inner loop still works.
+// CHECK-LABEL: func.func @collapse_no_redundant_inner
+// CHECK:       acc.compute_region
+// CHECK:       scf.parallel
+// CHECK-NOT:   scf.for
+// CHECK:       scf.reduce
+func.func @collapse_no_redundant_inner(%buf: memref<1xf32>) {
+  %c0 = arith.constant 0 : index
+  %c1 = arith.constant 1 : index
+  %c10 = arith.constant 10 : index
+  %c20 = arith.constant 20 : index
+
+  %dev = acc.copyin varPtr(%buf : memref<1xf32>) -> memref<1xf32>
+  acc.parallel dataOperands(%dev : memref<1xf32>) {
+    acc.loop combined(parallel) control(%i : index, %j : index) = (%c1, %c1 : index, index) to (%c10, %c20 : index, index) step (%c1, %c1 : index, index) {
+      %vj = arith.index_cast %j : index to i32
+      %valf = arith.sitofp %vj : i32 to f32
+      memref.store %valf, %dev[%c0] : memref<1xf32>
+      acc.yield
+    } attributes {collapse = [2], collapseDeviceType = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true, true>, independent = [#acc.device_type<none>]}
+    acc.yield
+  }
+  acc.copyout accPtr(%dev : memref<1xf32>) to varPtr(%buf : memref<1xf32>)
+  return
+}

>From fd9c9d7a204198da698e18dda5698753a422b912 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Tue, 7 Apr 2026 16:37:45 -0700
Subject: [PATCH 3/5] [OpenACC] Revert Flang lowering change, keep
 ACCComputeLowering fix
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The Flang lowering approach (skipping inner acc.loop generation) doesn't
work because returning nullptr falls through to normal loop handling,
which generates a regular fir.do_loop — incorrect for collapsed loops
where the dimension should not be iterated at all.

The ACCComputeLowering pre-pass is the correct fix location: let Flang
generate the inner acc.loop (it handles IV privatization correctly),
then eliminate it at the MLIR level before SCF conversion by inlining
its body with the outer's collapsed IV.

Made-with: Cursor
---
 .../OpenACC/Transforms/ACCComputeLowering.cpp      | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
index 6d9d75eb15c24..49674574fddfe 100644
--- a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
+++ b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
@@ -278,26 +278,20 @@ static void eliminateCollapsedInnerLoops(LoopOp outerLoop,
     if (innerLoop.getBody().getNumArguments() != 1)
       return WalkResult::advance();
 
-    // Check if the inner loop's bounds match any of the outer loop's
-    // collapsed dimensions (by comparing the original lb/ub/step values).
     Value innerLB = innerLoop.getLowerbound()[0];
     Value innerUB = innerLoop.getUpperbound()[0];
     Value innerStep = innerLoop.getStep()[0];
 
     for (unsigned ivIdx = 0; ivIdx < numOuterIVs; ++ivIdx) {
-      Value outerLB = outerLoop.getLowerbound()[ivIdx];
-      Value outerUB = outerLoop.getUpperbound()[ivIdx];
-      Value outerStep = outerLoop.getStep()[ivIdx];
-
-      if (innerLB != outerLB || innerUB != outerUB || innerStep != outerStep)
+      if (innerLB != outerLoop.getLowerbound()[ivIdx] ||
+          innerUB != outerLoop.getUpperbound()[ivIdx] ||
+          innerStep != outerLoop.getStep()[ivIdx])
         continue;
 
-      // Match found — replace the inner loop's IV with the outer's.
       Value outerIV = outerLoop.getBody().getArgument(ivIdx);
-      Value innerIV = innerLoop.getBody().getArgument(0);
 
       IRMapping mapping;
-      mapping.map(innerIV, outerIV);
+      mapping.map(innerLoop.getBody().getArgument(0), outerIV);
 
       rewriter.setInsertionPoint(innerLoop);
       cloneACCRegionInto(&innerLoop.getRegion(), innerLoop->getBlock(),

>From df3cf2c52caed8d587a715df5d914de98e2a487e Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Tue, 7 Apr 2026 17:37:58 -0700
Subject: [PATCH 4/5] [Flang][OpenACC] Skip inner do-loop generation for
 collapsed dimensions

When collapse(force:N) creates an outer acc.loop with N control
variables, the inner do-loops covered by the collapse should not
generate separate loops.  Previously, Flang's PFT evaluation walker
would call genOpenACCLoopFromDoConstruct for each inner do-loop,
generating a redundant acc.loop that re-iterated the same dimension.
This caused each GPU thread to iterate the full inner dimension
instead of computing a single element.

Fix by adding isInsideCollapsedACCLoop() which checks if we are inside
a collapsed acc.loop with unconsumed dimensions.  When true, Bridge.cpp
skips both the acc.loop and normal loop generation for the inner
do-construct, lowering only its body.  The IV is already available from
the parent acc.loop's block argument.

This replaces the previous ACCComputeLowering workaround that
eliminated the redundant inner loop at the MLIR level.

Fixes nv2113281.

Made-with: Cursor
---
 flang/include/flang/Lower/OpenACC.h           |  6 ++
 flang/lib/Lower/Bridge.cpp                    | 58 ++++++++++-------
 flang/lib/Lower/OpenACC.cpp                   | 16 +++++
 .../acc-loop-collapse-force-lowering.f90      | 16 +++--
 .../OpenACC/Transforms/ACCComputeLowering.cpp | 62 -------------------
 ...c-compute-lowering-collapse-reduction.mlir | 46 ++------------
 6 files changed, 68 insertions(+), 136 deletions(-)

diff --git a/flang/include/flang/Lower/OpenACC.h b/flang/include/flang/Lower/OpenACC.h
index 745c8686457f4..15b53d3b3aae9 100644
--- a/flang/include/flang/Lower/OpenACC.h
+++ b/flang/include/flang/Lower/OpenACC.h
@@ -110,6 +110,12 @@ getCollapseSizeAndForce(const Fortran::parser::AccClauseList &);
 /// Checks whether the current insertion point is inside OpenACC loop.
 bool isInOpenACCLoop(fir::FirOpBuilder &);
 
+/// Checks whether the current insertion point is inside a collapsed acc.loop
+/// and there are still collapsed dimensions to consume (i.e., this do-loop
+/// should not generate a separate loop because the parent already distributes
+/// this dimension).
+bool isInsideCollapsedACCLoop(fir::FirOpBuilder &);
+
 /// Checks whether the current insertion point is inside OpenACC compute
 /// construct.
 bool isInsideOpenACCComputeConstruct(fir::FirOpBuilder &);
diff --git a/flang/lib/Lower/Bridge.cpp b/flang/lib/Lower/Bridge.cpp
index 63867fab57678..365abeb0ae0dd 100644
--- a/flang/lib/Lower/Bridge.cpp
+++ b/flang/lib/Lower/Bridge.cpp
@@ -2493,6 +2493,18 @@ class FirConverter : public Fortran::lower::AbstractConverter {
     Fortran::lower::pft::Evaluation &eval = getEval();
     bool unstructuredContext = eval.lowerAsUnstructured();
 
+    // If this tightly-nested do-loop is inside a collapsed acc.loop and
+    // its dimension is already covered by the parent's control variables,
+    // skip generating any loop — just lower the body.  The IV value is
+    // already available from the parent acc.loop's block argument (stored
+    // into the private variable by buildACCLoopOp).
+    if (Fortran::lower::isInsideCollapsedACCLoop(*builder)) {
+      auto iter = eval.getNestedEvaluations().begin();
+      for (auto end = --eval.getNestedEvaluations().end(); iter != end; ++iter)
+        genFIR(*iter, unstructuredContext);
+      return;
+    }
+
     // Loops with induction variables inside OpenACC compute constructs
     // need special handling to ensure that the IVs are privatized.
     if (Fortran::lower::isInsideOpenACCComputeConstruct(*builder)) {
@@ -6273,29 +6285,29 @@ class FirConverter : public Fortran::lower::AbstractConverter {
   // calls does block management, possibly starting a new block, and possibly
   // generating a branch to end a block. So these calls may still be required
   // for that functionality.
-  void genFIR(const Fortran::parser::AssociateStmt &) {}       // nop
-  void genFIR(const Fortran::parser::BlockStmt &) {}           // nop
-  void genFIR(const Fortran::parser::CaseStmt &) {}            // nop
-  void genFIR(const Fortran::parser::ContinueStmt &) {}        // nop
-  void genFIR(const Fortran::parser::ElseIfStmt &) {}          // nop
-  void genFIR(const Fortran::parser::ElseStmt &) {}            // nop
-  void genFIR(const Fortran::parser::EndAssociateStmt &) {}    // nop
-  void genFIR(const Fortran::parser::EndBlockStmt &) {}        // nop
-  void genFIR(const Fortran::parser::EndDoStmt &) {}           // nop
-  void genFIR(const Fortran::parser::EndFunctionStmt &) {}     // nop
-  void genFIR(const Fortran::parser::EndIfStmt &) {}           // nop
-  void genFIR(const Fortran::parser::EndMpSubprogramStmt &) {} // nop
-  void genFIR(const Fortran::parser::EndProgramStmt &) {}      // nop
-  void genFIR(const Fortran::parser::EndSelectStmt &) {}       // nop
-  void genFIR(const Fortran::parser::EndSubroutineStmt &) {}   // nop
-  void genFIR(const Fortran::parser::EntryStmt &) {}           // nop
-  void genFIR(const Fortran::parser::IfStmt &) {}              // nop
-  void genFIR(const Fortran::parser::IfThenStmt &) {}          // nop
-  void genFIR(const Fortran::parser::NonLabelDoStmt &) {}      // nop
-  void genFIR(const Fortran::parser::OmpEndLoopDirective &) {} // nop
-  void genFIR(const Fortran::parser::SelectTypeStmt &) {}      // nop
-  void genFIR(const Fortran::parser::TypeGuardStmt &) {}       // nop
-  void genFIR(const Fortran::parser::ChangeTeamStmt &stmt) {}  // nop
+  void genFIR(const Fortran::parser::AssociateStmt &) {}         // nop
+  void genFIR(const Fortran::parser::BlockStmt &) {}             // nop
+  void genFIR(const Fortran::parser::CaseStmt &) {}              // nop
+  void genFIR(const Fortran::parser::ContinueStmt &) {}          // nop
+  void genFIR(const Fortran::parser::ElseIfStmt &) {}            // nop
+  void genFIR(const Fortran::parser::ElseStmt &) {}              // nop
+  void genFIR(const Fortran::parser::EndAssociateStmt &) {}      // nop
+  void genFIR(const Fortran::parser::EndBlockStmt &) {}          // nop
+  void genFIR(const Fortran::parser::EndDoStmt &) {}             // nop
+  void genFIR(const Fortran::parser::EndFunctionStmt &) {}       // nop
+  void genFIR(const Fortran::parser::EndIfStmt &) {}             // nop
+  void genFIR(const Fortran::parser::EndMpSubprogramStmt &) {}   // nop
+  void genFIR(const Fortran::parser::EndProgramStmt &) {}        // nop
+  void genFIR(const Fortran::parser::EndSelectStmt &) {}         // nop
+  void genFIR(const Fortran::parser::EndSubroutineStmt &) {}     // nop
+  void genFIR(const Fortran::parser::EntryStmt &) {}             // nop
+  void genFIR(const Fortran::parser::IfStmt &) {}                // nop
+  void genFIR(const Fortran::parser::IfThenStmt &) {}            // nop
+  void genFIR(const Fortran::parser::NonLabelDoStmt &) {}        // nop
+  void genFIR(const Fortran::parser::OmpEndLoopDirective &) {}   // nop
+  void genFIR(const Fortran::parser::SelectTypeStmt &) {}        // nop
+  void genFIR(const Fortran::parser::TypeGuardStmt &) {}         // nop
+  void genFIR(const Fortran::parser::ChangeTeamStmt &stmt) {}    // nop
   void genFIR(const Fortran::parser::EndChangeTeamStmt &stmt) {} // nop
 
   /// Generate FIR for Evaluation \p eval.
diff --git a/flang/lib/Lower/OpenACC.cpp b/flang/lib/Lower/OpenACC.cpp
index 2154f38dca568..e181f2afc2290 100644
--- a/flang/lib/Lower/OpenACC.cpp
+++ b/flang/lib/Lower/OpenACC.cpp
@@ -4546,6 +4546,22 @@ bool Fortran::lower::isInOpenACCLoop(fir::FirOpBuilder &builder) {
   return false;
 }
 
+bool Fortran::lower::isInsideCollapsedACCLoop(fir::FirOpBuilder &builder) {
+  auto parentLoop =
+      builder.getBlock()->getParent()->getParentOfType<mlir::acc::LoopOp>();
+  if (!parentLoop)
+    return false;
+  unsigned numParentIVs = parentLoop.getBody().getNumArguments();
+  if (numParentIVs <= 1)
+    return false;
+
+  // Check that there are still collapsed dimensions to consume:
+  // no inner acc.loop has been generated yet.
+  unsigned innerAccLoops = 0;
+  parentLoop.getBody().walk([&](mlir::acc::LoopOp) { ++innerAccLoops; });
+  return (innerAccLoops + 1 < numParentIVs);
+}
+
 bool Fortran::lower::isInsideOpenACCComputeConstruct(
     fir::FirOpBuilder &builder) {
   return mlir::isa_and_nonnull<ACC_COMPUTE_CONSTRUCT_OPS>(
diff --git a/flang/test/Lower/OpenACC/acc-loop-collapse-force-lowering.f90 b/flang/test/Lower/OpenACC/acc-loop-collapse-force-lowering.f90
index ca932c1b159ba..006f613dd2109 100644
--- a/flang/test/Lower/OpenACC/acc-loop-collapse-force-lowering.f90
+++ b/flang/test/Lower/OpenACC/acc-loop-collapse-force-lowering.f90
@@ -1,7 +1,7 @@
 ! RUN: bbc -fopenacc -emit-hlfir %s -o - | FileCheck %s
 
 ! Verify collapse(force:2) sinks prologue (between loops) and epilogue (after inner loop)
-! into the acc.loop region body.
+! into the acc.loop region body without generating a redundant inner acc.loop.
 
 subroutine collapse_force_sink(n, m)
   integer, intent(in) :: n, m
@@ -22,20 +22,18 @@ subroutine collapse_force_sink(n, m)
 
 ! CHECK: func.func @_QPcollapse_force_sink(
 ! CHECK: acc.parallel
-! Ensure outer acc.loop is combined(parallel)
+! Ensure outer acc.loop is combined(parallel) with 2 IVs and no inner acc.loop
 ! CHECK: acc.loop combined(parallel)
-! Prologue: constant 4.2 and an assign before inner loop
+! Prologue: constant 4.2 and an assign before inner loop body
 ! CHECK: arith.constant 4.200000e+00
 ! CHECK: hlfir.assign
-! Inner loop and its body include 2.0 add and an assign
-! CHECK: acc.loop
+! Inner loop body (no acc.loop wrapping it): 2.0 add and an assign
 ! CHECK: arith.constant 2.000000e+00
 ! CHECK: arith.addf
 ! CHECK: hlfir.assign
-! Epilogue: constant 7.3 and an assign after inner loop
+! Epilogue: constant 7.3 and an assign after inner loop body
 ! CHECK: arith.constant 7.300000e+00
 ! CHECK: hlfir.assign
-! And the outer acc.loop has collapse = [2]
+! The acc.loop has collapse = [2] and no inner acc.loop
 ! CHECK: } attributes {collapse = [2]
-
-
+! CHECK-NOT: acc.loop
diff --git a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
index 49674574fddfe..9cc36312d3615 100644
--- a/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
+++ b/mlir/lib/Dialect/OpenACC/Transforms/ACCComputeLowering.cpp
@@ -255,59 +255,6 @@ assignKnownLaunchArgs<SerialOp>(SerialOp, DeviceType, RewriterBase &,
 // Loop conversion pattern
 //===----------------------------------------------------------------------===//
 
-/// When collapse(force:N) generates an outer acc.loop with N control
-/// variables, the body may still contain inner acc.loop ops that re-iterate
-/// dimensions already covered by the collapse.  Before converting the outer
-/// loop to scf.parallel, eliminate these redundant inner loops by inlining
-/// their bodies and replacing their IVs with the corresponding outer IVs.
-static void eliminateCollapsedInnerLoops(LoopOp outerLoop,
-                                         RewriterBase &rewriter) {
-  auto collapseVal = outerLoop.getCollapseValue();
-  if (!collapseVal || *collapseVal <= 1)
-    return;
-
-  unsigned numOuterIVs = outerLoop.getBody().getNumArguments();
-  if (numOuterIVs <= 1)
-    return;
-
-  SmallVector<LoopOp> innerLoopsToErase;
-
-  outerLoop.getBody().walk([&](LoopOp innerLoop) {
-    if (innerLoop == outerLoop)
-      return WalkResult::advance();
-    if (innerLoop.getBody().getNumArguments() != 1)
-      return WalkResult::advance();
-
-    Value innerLB = innerLoop.getLowerbound()[0];
-    Value innerUB = innerLoop.getUpperbound()[0];
-    Value innerStep = innerLoop.getStep()[0];
-
-    for (unsigned ivIdx = 0; ivIdx < numOuterIVs; ++ivIdx) {
-      if (innerLB != outerLoop.getLowerbound()[ivIdx] ||
-          innerUB != outerLoop.getUpperbound()[ivIdx] ||
-          innerStep != outerLoop.getStep()[ivIdx])
-        continue;
-
-      Value outerIV = outerLoop.getBody().getArgument(ivIdx);
-
-      IRMapping mapping;
-      mapping.map(innerLoop.getBody().getArgument(0), outerIV);
-
-      rewriter.setInsertionPoint(innerLoop);
-      cloneACCRegionInto(&innerLoop.getRegion(), innerLoop->getBlock(),
-                         Block::iterator(innerLoop), mapping,
-                         innerLoop->getResults());
-
-      innerLoopsToErase.push_back(innerLoop);
-      return WalkResult::interrupt();
-    }
-    return WalkResult::advance();
-  });
-
-  for (LoopOp loop : innerLoopsToErase)
-    rewriter.eraseOp(loop);
-}
-
 class ACCLoopConversion : public OpRewritePattern<LoopOp> {
 public:
   ACCLoopConversion(MLIRContext *ctx, const ACCToGPUMappingPolicy &policy,
@@ -437,15 +384,6 @@ class ACCComputeLowering
 
     DefaultACCToGPUMappingPolicy policy;
 
-    // Pre-pass: eliminate redundant inner acc.loop ops that re-iterate
-    // dimensions already covered by a collapsed outer acc.loop.  This must
-    // happen before the greedy loop conversion because the inner loops
-    // would otherwise be converted to scf.for independently.
-    IRRewriter rewriter(context);
-    op.walk([&](LoopOp outerLoop) {
-      eliminateCollapsedInnerLoops(outerLoop, rewriter);
-    });
-
     // Part 1: Convert acc.loop to scf.parallel/scf.for while the parent
     // compute construct is still present (needed to determine conversion
     // strategy).
diff --git a/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir b/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir
index 196b7c234428f..5d2f01362f3b0 100644
--- a/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir
+++ b/mlir/test/Dialect/OpenACC/acc-compute-lowering-collapse-reduction.mlir
@@ -1,18 +1,14 @@
 // RUN: mlir-opt %s -acc-compute-lowering | FileCheck %s
 
-// Test that collapse(force:2) with a redundant inner acc.loop eliminates
-// the inner loop.  The outer acc.loop has 2 control variables (for the
-// collapsed dimensions) but the body contains an inner acc.loop that
-// re-iterates the second dimension.  The inner loop should be eliminated
-// and its IV replaced with the outer's corresponding IV.
+// Test that a collapsed acc.loop with 2 IVs converts correctly to
+// scf.parallel with 2 dimensions and no redundant inner loops.
 
-// CHECK-LABEL: func.func @collapse_force_with_redundant_inner
+// CHECK-LABEL: func.func @collapse_two_ivs
 // CHECK:       acc.compute_region
-// After conversion, the scf.parallel body should NOT contain an scf.for
 // CHECK:       scf.parallel
 // CHECK-NOT:   scf.for
 // CHECK:       scf.reduce
-func.func @collapse_force_with_redundant_inner(%buf: memref<1xf32>) {
+func.func @collapse_two_ivs(%buf: memref<1xf32>) {
   %c0 = arith.constant 0 : index
   %c1 = arith.constant 1 : index
   %c100 = arith.constant 100 : index
@@ -26,40 +22,6 @@ func.func @collapse_force_with_redundant_inner(%buf: memref<1xf32>) {
       %sum = arith.addi %vi, %vj : i32
       %valf = arith.sitofp %sum : i32 to f32
       memref.store %valf, %dev[%c0] : memref<1xf32>
-      acc.loop control(%j_inner : index) = (%c1 : index) to (%c200 : index) step (%c1 : index) {
-        %vj2 = arith.index_cast %j_inner : index to i32
-        %valf2 = arith.sitofp %vj2 : i32 to f32
-        memref.store %valf2, %dev[%c0] : memref<1xf32>
-        acc.yield
-      } attributes {auto_ = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true>}
-      acc.yield
-    } attributes {collapse = [2], collapseDeviceType = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true, true>, independent = [#acc.device_type<none>]}
-    acc.yield
-  }
-  acc.copyout accPtr(%dev : memref<1xf32>) to varPtr(%buf : memref<1xf32>)
-  return
-}
-
-// -----
-
-// Verify that a collapsed loop without a redundant inner loop still works.
-// CHECK-LABEL: func.func @collapse_no_redundant_inner
-// CHECK:       acc.compute_region
-// CHECK:       scf.parallel
-// CHECK-NOT:   scf.for
-// CHECK:       scf.reduce
-func.func @collapse_no_redundant_inner(%buf: memref<1xf32>) {
-  %c0 = arith.constant 0 : index
-  %c1 = arith.constant 1 : index
-  %c10 = arith.constant 10 : index
-  %c20 = arith.constant 20 : index
-
-  %dev = acc.copyin varPtr(%buf : memref<1xf32>) -> memref<1xf32>
-  acc.parallel dataOperands(%dev : memref<1xf32>) {
-    acc.loop combined(parallel) control(%i : index, %j : index) = (%c1, %c1 : index, index) to (%c10, %c20 : index, index) step (%c1, %c1 : index, index) {
-      %vj = arith.index_cast %j : index to i32
-      %valf = arith.sitofp %vj : i32 to f32
-      memref.store %valf, %dev[%c0] : memref<1xf32>
       acc.yield
     } attributes {collapse = [2], collapseDeviceType = [#acc.device_type<none>], inclusiveUpperbound = array<i1: true, true>, independent = [#acc.device_type<none>]}
     acc.yield

>From 7dee2ddaa47760493abe48beee42fa483b9e24b9 Mon Sep 17 00:00:00 2001
From: Kazuaki Matsumura <kmatsumura at nvidia.com>
Date: Tue, 7 Apr 2026 17:44:01 -0700
Subject: [PATCH 5/5] [Flang] Restore original whitespace in Bridge.cpp nop
 functions

Made-with: Cursor
---
 flang/lib/Lower/Bridge.cpp | 46 +++++++++++++++++++-------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/flang/lib/Lower/Bridge.cpp b/flang/lib/Lower/Bridge.cpp
index 365abeb0ae0dd..5bd862c74661b 100644
--- a/flang/lib/Lower/Bridge.cpp
+++ b/flang/lib/Lower/Bridge.cpp
@@ -6285,29 +6285,29 @@ class FirConverter : public Fortran::lower::AbstractConverter {
   // calls does block management, possibly starting a new block, and possibly
   // generating a branch to end a block. So these calls may still be required
   // for that functionality.
-  void genFIR(const Fortran::parser::AssociateStmt &) {}         // nop
-  void genFIR(const Fortran::parser::BlockStmt &) {}             // nop
-  void genFIR(const Fortran::parser::CaseStmt &) {}              // nop
-  void genFIR(const Fortran::parser::ContinueStmt &) {}          // nop
-  void genFIR(const Fortran::parser::ElseIfStmt &) {}            // nop
-  void genFIR(const Fortran::parser::ElseStmt &) {}              // nop
-  void genFIR(const Fortran::parser::EndAssociateStmt &) {}      // nop
-  void genFIR(const Fortran::parser::EndBlockStmt &) {}          // nop
-  void genFIR(const Fortran::parser::EndDoStmt &) {}             // nop
-  void genFIR(const Fortran::parser::EndFunctionStmt &) {}       // nop
-  void genFIR(const Fortran::parser::EndIfStmt &) {}             // nop
-  void genFIR(const Fortran::parser::EndMpSubprogramStmt &) {}   // nop
-  void genFIR(const Fortran::parser::EndProgramStmt &) {}        // nop
-  void genFIR(const Fortran::parser::EndSelectStmt &) {}         // nop
-  void genFIR(const Fortran::parser::EndSubroutineStmt &) {}     // nop
-  void genFIR(const Fortran::parser::EntryStmt &) {}             // nop
-  void genFIR(const Fortran::parser::IfStmt &) {}                // nop
-  void genFIR(const Fortran::parser::IfThenStmt &) {}            // nop
-  void genFIR(const Fortran::parser::NonLabelDoStmt &) {}        // nop
-  void genFIR(const Fortran::parser::OmpEndLoopDirective &) {}   // nop
-  void genFIR(const Fortran::parser::SelectTypeStmt &) {}        // nop
-  void genFIR(const Fortran::parser::TypeGuardStmt &) {}         // nop
-  void genFIR(const Fortran::parser::ChangeTeamStmt &stmt) {}    // nop
+  void genFIR(const Fortran::parser::AssociateStmt &) {}       // nop
+  void genFIR(const Fortran::parser::BlockStmt &) {}           // nop
+  void genFIR(const Fortran::parser::CaseStmt &) {}            // nop
+  void genFIR(const Fortran::parser::ContinueStmt &) {}        // nop
+  void genFIR(const Fortran::parser::ElseIfStmt &) {}          // nop
+  void genFIR(const Fortran::parser::ElseStmt &) {}            // nop
+  void genFIR(const Fortran::parser::EndAssociateStmt &) {}    // nop
+  void genFIR(const Fortran::parser::EndBlockStmt &) {}        // nop
+  void genFIR(const Fortran::parser::EndDoStmt &) {}           // nop
+  void genFIR(const Fortran::parser::EndFunctionStmt &) {}     // nop
+  void genFIR(const Fortran::parser::EndIfStmt &) {}           // nop
+  void genFIR(const Fortran::parser::EndMpSubprogramStmt &) {} // nop
+  void genFIR(const Fortran::parser::EndProgramStmt &) {}      // nop
+  void genFIR(const Fortran::parser::EndSelectStmt &) {}       // nop
+  void genFIR(const Fortran::parser::EndSubroutineStmt &) {}   // nop
+  void genFIR(const Fortran::parser::EntryStmt &) {}           // nop
+  void genFIR(const Fortran::parser::IfStmt &) {}              // nop
+  void genFIR(const Fortran::parser::IfThenStmt &) {}          // nop
+  void genFIR(const Fortran::parser::NonLabelDoStmt &) {}      // nop
+  void genFIR(const Fortran::parser::OmpEndLoopDirective &) {} // nop
+  void genFIR(const Fortran::parser::SelectTypeStmt &) {}      // nop
+  void genFIR(const Fortran::parser::TypeGuardStmt &) {}       // nop
+  void genFIR(const Fortran::parser::ChangeTeamStmt &stmt) {}  // nop
   void genFIR(const Fortran::parser::EndChangeTeamStmt &stmt) {} // nop
 
   /// Generate FIR for Evaluation \p eval.