[Mlir-commits] [mlir] [mlir]introduce UnrollScopeInterface and apply it to funcOp and gpu.launch Op. (PR #123904)
lonely eagle
llvmlistbot at llvm.org
Wed Jan 22 08:16:36 PST 2025
linuxlonelyeagle wrote:
> Have you tried using `-canonicalize` and `-cse` passes? Generating `redundant SSA values` isn't problem in general as long as they are folded away. I think these two passes should do it.
>
> Try writing a complete IR with OP that has side effect, can be store op, and use `-canonicalize -cse` and see what happens.
`-canonicalize` and `-cse` are useless. `cse` will simply eliminate gpu.launch.You probably don't fully understand the key to this problem.The affine-loop-unroll pass extracts the IR from the loop when the loop iterates only once.
Please see the changes in the code. Originally, it would create the `IV` of the loop directly in `funcOp`, which is the reason for the appearance of redundant SSA value.
In this example, the `IV` should be created in `gpu.launch` instead of `funcOp`. Now we can use `UnrollScopeInterface` to ensure that the `IV` is created in the nearest `unroll scope region` (in this case, the `region` of `gpu.launch`).
`%c0` is the SSA value that should be created in gpu.launch. For visual convenience, I ran `-gpu-kernel-outlining`
Thanks for your reply, I think I have described the problem this PR more clearly.
```
// %c0 is the SSA value that should be created in gpu.launch. For visual convenience, I ran -gpu-kernel-outlining
%c0 = arith.constant 0 : index
gpu.launch_func @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1) args(%c0 : index)
```
```
mlir-opt test-affine-loops-unroll.mlir -affine-loop-unroll="unroll-full" -gpu-kernel-outlining root at e3f83748ef6b
#map = affine_map<(d0) -> (d0 + 1)>
#map1 = affine_map<(d0) -> (d0 + 2)>
#map2 = affine_map<(d0) -> (d0 + 3)>
module attributes {gpu.container_module} {
func.func @main() {
%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index
gpu.launch_func @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c1, %c1, %c1) args(%c0 : index)
return
}
gpu.module @main_kernel {
gpu.func @main_kernel(%arg0: index) kernel attributes {known_block_size = array<i32: 1, 1, 1>, known_grid_size = array<i32: 1, 1, 1>} {
%block_id_x = gpu.block_id x
%block_id_y = gpu.block_id y
%block_id_z = gpu.block_id z
%thread_id_x = gpu.thread_id x
%thread_id_y = gpu.thread_id y
%thread_id_z = gpu.thread_id z
%grid_dim_x = gpu.grid_dim x
%grid_dim_y = gpu.grid_dim y
%grid_dim_z = gpu.grid_dim z
%block_dim_x = gpu.block_dim x
%block_dim_y = gpu.block_dim y
%block_dim_z = gpu.block_dim z
%cst = arith.constant dense<0.000000e+00> : vector<2x4x2x2xf16>
%0 = affine.for %arg1 = 0 to 2 iter_args(%arg2 = %cst) -> (vector<2x4x2x2xf16>) {
%cst_0 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
%1 = vector.insert %cst_0, %arg2 [%arg1, %arg0] : vector<2x2xf16> into vector<2x4x2x2xf16>
%2 = affine.apply #map(%arg0)
%cst_1 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
%3 = vector.insert %cst_1, %arg2 [%arg1, %2] : vector<2x2xf16> into vector<2x4x2x2xf16>
%4 = affine.apply #map1(%arg0)
%cst_2 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
%5 = vector.insert %cst_2, %arg2 [%arg1, %4] : vector<2x2xf16> into vector<2x4x2x2xf16>
%6 = affine.apply #map2(%arg0)
%cst_3 = arith.constant dense<0.000000e+00> : vector<2x2xf16>
%7 = vector.insert %cst_3, %arg2 [%arg1, %6] : vector<2x2xf16> into vector<2x4x2x2xf16>
affine.yield %7 : vector<2x4x2x2xf16>
}
gpu.return
}
}
}
```
https://github.com/llvm/llvm-project/pull/123904
More information about the Mlir-commits
mailing list