[Mlir-commits] [mlir] [mlir][bufferization] Fix bug in bufferization of elementwise ops (PR #97209)

Sun Jun 30 04:57:06 PDT 2024

https://github.com/matthias-springer created https://github.com/llvm/llvm-project/pull/97209

There is an optimization in One-Shot Bufferize wrt. ops that bufferize to elementwise access. A copy can sometimes be avoided. E.g.:
```
%0 = tensor.empty()
%1 = tensor.fill ...
%2 = linalg.map ins(%1, ...) outs(%1)
```

In the above example, a buffer copy is not needed for %1, even though the same buffer is read/written by two different operands (of the same op). That's because the op bufferizes to elementwise access.

```c++
// Two equivalent operands of the same op are not conflicting if the op
// bufferizes to element-wise access. I.e., all loads at a position
// happen before all stores to the same position.
```

This optimization cannot be applied when op dominance cannot be used to rule out conflicts. E.g., when the `linalg.map` is inside of a loop. In such a case, the reads/writes happen multiple times and it is not guaranteed that "all loads at a position happen before all stores to the same position."

Fixes #90019.

>From b91cc91f918e61adc9e55276fa43261c1ccceb23 Mon Sep 17 00:00:00 2001
From: Matthias Springer <mspringer at nvidia.com>
Date: Sun, 30 Jun 2024 13:50:08 +0200
Subject: [PATCH] [mlir][bufferization] Fix bug in bufferization of elementwise
 ops

There is an optimization in One-Shot Bufferize wrt. ops that bufferize to elementwise access. In such cases, a copy can sometimes be avoided. E.g.:
```
%0 = tensor.empty()
%1 = tensor.fill ...
%2 = linalg.map ins(%1, ...) outs(%1)
```

In the above example, a buffer copy is not needed for %1, even though the same buffer is read/written by two different operand. That's because the op bufferizes to elementwise access.

```c++
// Two equivalent operands of the same op are not conflicting if the op
// bufferizes to element-wise access. I.e., all loads at a position
// happen before all stores to the same position.
```

This optimization cannot be applied when op dominance cannot be used to rule out conflicts. E.g., when the `linalg.map` is inside of a loop. In such a case, the reads/writes happen multiple times and it is not guaranteed that "all loads at a position happen before all stores to the same position."

Fixes #90019.
---
 .../Transforms/OneShotAnalysis.cpp            | 32 +++++++++----------
 .../Linalg/one-shot-bufferize-analysis.mlir   | 28 ++++++++++++++++
 2 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/mlir/lib/Dialect/Bufferization/Transforms/OneShotAnalysis.cpp b/mlir/lib/Dialect/Bufferization/Transforms/OneShotAnalysis.cpp
index d0b4e0dd4383e..975bfb4d41e0b 100644
--- a/mlir/lib/Dialect/Bufferization/Transforms/OneShotAnalysis.cpp
+++ b/mlir/lib/Dialect/Bufferization/Transforms/OneShotAnalysis.cpp
@@ -725,23 +725,23 @@ hasReadAfterWriteInterference(const DenseSet<OpOperand *> &usesRead,
                                      "mutually exclusive regions\n");
           continue;
         }
-      }
 
-      // Two equivalent operands of the same op are not conflicting if the op
-      // bufferizes to element-wise access. I.e., all loads at a position happen
-      // before all stores to the same position.
-      if (conflictingWritingOp == readingOp) {
-        if (auto bufferizableOp = options.dynCastBufferizableOp(readingOp)) {
-          if (bufferizableOp.bufferizesToElementwiseAccess(
-                  state, {uRead, uConflictingWrite})) {
-            if (hasEquivalentValueInReverseUseDefChain(
-                    state, uRead->get(), uConflictingWrite->get()) ||
-                hasEquivalentValueInReverseUseDefChain(
-                    state, uConflictingWrite->get(), uRead->get())) {
-              LLVM_DEBUG(
-                  llvm::dbgs()
-                  << "  no conflict: op bufferizes to element-wise access\n");
-              continue;
+        // Two equivalent operands of the same op are not conflicting if the op
+        // bufferizes to element-wise access. I.e., all loads at a position
+        // happen before all stores to the same position.
+        if (conflictingWritingOp == readingOp) {
+          if (auto bufferizableOp = options.dynCastBufferizableOp(readingOp)) {
+            if (bufferizableOp.bufferizesToElementwiseAccess(
+                    state, {uRead, uConflictingWrite})) {
+              if (hasEquivalentValueInReverseUseDefChain(
+                      state, uRead->get(), uConflictingWrite->get()) ||
+                  hasEquivalentValueInReverseUseDefChain(
+                      state, uConflictingWrite->get(), uRead->get())) {
+                LLVM_DEBUG(
+                    llvm::dbgs()
+                    << "  no conflict: op bufferizes to element-wise access\n");
+                continue;
+              }
             }
           }
         }
diff --git a/mlir/test/Dialect/Linalg/one-shot-bufferize-analysis.mlir b/mlir/test/Dialect/Linalg/one-shot-bufferize-analysis.mlir
index 2d79a80cddc2b..5b7c2baf9d84f 100644
--- a/mlir/test/Dialect/Linalg/one-shot-bufferize-analysis.mlir
+++ b/mlir/test/Dialect/Linalg/one-shot-bufferize-analysis.mlir
@@ -107,3 +107,31 @@ func.func @elementwise_no_conflict_4(%arg0: tensor<8x32x32x32xf32>, %arg1: tenso
   }
   return %r : tensor<8x32x32x32xf32>
 }
+
+// -----
+
+// CHECK-LABEL: func @elementwise_access_regression(
+//       CHECK:   linalg.fill {__inplace_operands_attr__ = ["none", "false"]}
+//       CHECK:   linalg.map
+//  CHECK-SAME:   {__inplace_operands_attr__ = ["true", "true", "true"]}
+//       CHECK:   linalg.map
+//  CHECK-SAME:   {__inplace_operands_attr__ = ["true", "true", "true"]}
+func.func private @f(%arg: tensor<32x1xf32>) -> ()
+func.func @elementwise_access_regression(%arg0: i32, %arg2: tensor<32x1xf32>, %arg3: tensor<32x1xf32>) {
+      %cst_0 = arith.constant 0.000000e+00 : f32
+      %c0_i32 = arith.constant 0 : i32
+      %c1_i32 = arith.constant 1 : i32
+      %0 = tensor.empty() : tensor<32x1xf32>
+
+      // This op must bufferize out-of-place so that the filled tensor is not
+      // overwritten by the ops inside of the loop.
+      %1 = linalg.fill ins(%cst_0 : f32) outs(%0 : tensor<32x1xf32>) -> tensor<32x1xf32>
+
+      scf.for %arg1 = %c0_i32 to %arg0 step %c1_i32 : i32 {
+        %2 = linalg.map { arith.subf } ins(%1, %arg2 : tensor<32x1xf32>, tensor<32x1xf32>) outs(%0 : tensor<32x1xf32>)
+        %3 = tensor.empty() : tensor<32x1xf32>
+        %4 = linalg.map { arith.subf } ins(%2, %arg3 : tensor<32x1xf32>, tensor<32x1xf32>) outs(%3 : tensor<32x1xf32>)
+        func.call @f(%4) : (tensor<32x1xf32>) -> ()
+      }
+      return
+}