[Mlir-commits] [mlir] [mlir]introduce UnrollScopeInterface and apply it to funcOp and gpu.launch Op. (PR #123904)

Wed Jan 22 08:16:36 PST 2025

linuxlonelyeagle wrote:

> Have you tried using `-canonicalize` and `-cse` passes? Generating `redundant SSA values` isn't problem in general as long as they are folded away. I think these two passes should do it.
> 
> Try writing a complete IR with OP that has side effect, can be store op, and use `-canonicalize -cse` and see what happens.

`-canonicalize` and `-cse` are useless. `cse` will simply eliminate gpu.launch.You probably don't fully understand the key to this problem.The affine-loop-unroll pass extracts the IR from the loop when the loop iterates only once.
Please see the changes in the code. Originally, it would create the `IV` of the loop directly in `funcOp`, which is the reason for the appearance of redundant SSA value.
In this example, the `IV` should be created in `gpu.launch` instead of `funcOp`. Now we can use `UnrollScopeInterface` to ensure that the `IV` is created in the nearest `unroll scope region` (in this case, the `region` of `gpu.launch`).

`%c0` is the SSA value that should be created in gpu.launch. For visual convenience, I ran `-gpu-kernel-outlining`
Thanks for your reply, I think I have described the problem this PR more clearly.
```
 // %c0 is the SSA value that should be created in gpu.launch. For visual convenience, I ran -gpu-kernel-outlining
 %c0 = arith.constant 0 : index
  gpu.launch_func  @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)  args(%c0 : index)
```

```
mlir-opt test-affine-loops-unroll.mlir -affine-loop-unroll="unroll-full" -gpu-kernel-outlining                   root at e3f83748ef6b
#map = affine_map<(d0) -> (d0 + 1)>
#map1 = affine_map<(d0) -> (d0 + 2)>
#map2 = affine_map<(d0) -> (d0 + 3)>
module attributes {gpu.container_module} {
  func.func @main() {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    gpu.launch_func  @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1)  args(%c0 : index)
    return
  }
  gpu.module @main_kernel {
    gpu.func @main_kernel(%arg0: index) kernel attributes {known_block_size = array<i32: 1, 1, 1>, known_grid_size = array<i32: 1, 1, 1>} {
      %block_id_x = gpu.block_id  x
      %block_id_y = gpu.block_id  y
      %block_id_z = gpu.block_id  z
      %thread_id_x = gpu.thread_id  x
      %thread_id_y = gpu.thread_id  y
      %thread_id_z = gpu.thread_id  z
      %grid_dim_x = gpu.grid_dim  x
      %grid_dim_y = gpu.grid_dim  y
      %grid_dim_z = gpu.grid_dim  z
      %block_dim_x = gpu.block_dim  x
      %block_dim_y = gpu.block_dim  y
      %block_dim_z = gpu.block_dim  z
      %cst = arith.constant dense<0.000000e+00> : vector<2x4x2x2xf16>
      %0 = affine.for %arg1 = 0 to 2 iter_args(%arg2 = %cst) -> (vector<2x4x2x2xf16>) {
        %cst_0 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
        %1 = vector.insert %cst_0, %arg2 [%arg1, %arg0] : vector<2x2xf16> into vector<2x4x2x2xf16>
        %2 = affine.apply #map(%arg0)
        %cst_1 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
        %3 = vector.insert %cst_1, %arg2 [%arg1, %2] : vector<2x2xf16> into vector<2x4x2x2xf16>
        %4 = affine.apply #map1(%arg0)
        %cst_2 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
        %5 = vector.insert %cst_2, %arg2 [%arg1, %4] : vector<2x2xf16> into vector<2x4x2x2xf16>
        %6 = affine.apply #map2(%arg0)
        %cst_3 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
        %7 = vector.insert %cst_3, %arg2 [%arg1, %6] : vector<2x2xf16> into vector<2x4x2x2xf16>
        affine.yield %7 : vector<2x4x2x2xf16>
      }
      gpu.return
    }
  }
}

```

https://github.com/llvm/llvm-project/pull/123904