[all-commits] [llvm/llvm-project] a42a2c: Avoid buffer hoisting from parallel loops (#90735)

Fri May 3 23:35:59 PDT 2024

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: a42a2ca19b2325fa6844d6b10e88eb53a3f2fde8
      https://github.com/llvm/llvm-project/commit/a42a2ca19b2325fa6844d6b10e88eb53a3f2fde8
  Author: Rafael Ubal <rubal at mathworks.com>
  Date:   2024-05-04 (Sat, 04 May 2024)

  Changed paths:
    M mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
    M mlir/include/mlir/Interfaces/LoopLikeInterface.h
    M mlir/include/mlir/Interfaces/LoopLikeInterface.td
    M mlir/lib/Dialect/Bufferization/Transforms/BufferOptimizations.cpp
    M mlir/test/Dialect/Bufferization/Transforms/buffer-loop-hoisting.mlir

  Log Message:
  -----------
  Avoid buffer hoisting from parallel loops (#90735)

This change corrects an invalid behavior in pass
`--buffer-loop-hoisting`. The pass is in charge of extracting buffer
allocations (e.g., `memref.alloca`) from loop regions (e.g., `scf.for`)
when possible. This works OK for looks with sequential execution
semantics. However, a buffer allocated in the body of a parallel loop
may be concurrently accessed by multiple thread to store its local data.
Extracting such buffer from the loop causes all threads to wrongly share
the same memory region.

In the following example, dimension 1 of the input tensor is reversed.
Dimension 0 is traversed with a parallel loop.

```
func.func @f(%input: memref<2x3xf32>) -> memref<2x3xf32> {
  %c0 = index.constant 0
  %c1 = index.constant 1
  %c2 = index.constant 2
  %c3 = index.constant 3

  %output = memref.alloc() : memref<2x3xf32>
  scf.parallel (%index) = (%c0) to (%c2) step (%c1) {
    // Create subviews for working input and output slices
    %input_slice = memref.subview %input[%index, 2][1, 3][1, -1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, -1], offset: ?>>
    %output_slice = memref.subview %output[%index, 0][1, 3][1, 1] : memref<2x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>

    // Copy the input slice into this temporary buffer. This intermediate
    // copy is unnecessary, but is used for illustration purposes.
    %temp = memref.alloc() : memref<1x3xf32>
    memref.copy %input_slice, %temp : memref<1x3xf32, strided<[3, -1], offset: ?>> to memref<1x3xf32>

    // Copy temporary buffer into output slice
    memref.copy %temp, %output_slice : memref<1x3xf32> to memref<1x3xf32, strided<[3, 1], offset: ?>>
    scf.reduce
  }

  return %output : memref<2x3xf32>
}
```

The patch submitted here prevents `%temp = memref.alloc() :
memref<1x3xf32>` from being hoisted when the containing op is
`scf.parallel` or `scf.forall`. A new op trait called
`HasParallelRegion` is introduced and assigned to these two ops to
indicate that their regions have parallel execution semantics.

@joker-eph @ftynse @nicolasvasilache @sabauma

To unsubscribe from these emails, change your notification settings at https://github.com/llvm/llvm-project/settings/notifications