<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/62393>62393</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
mlir/affine: support loop fusion on affine.parallel loops
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
rohany
</td>
</tr>
</table>
<pre>
It would be nice if affine loop fusion passes could work on `affine.parallel` loops, rather than requiring a pass to convert `affine.for` loops to `affine.parallel` loops once fusion is done.
Here's a concrete use case I'm running into:
Consider a function that maps over a `1-D` memref, then reduces over the same memref:
```
func.func @mapandreduce(%input : memref<10xf32>, %output : memref<10xf32>) -> f32 {
%zero = arith.constant 0. : f32
%one = arith.constant 1. : f32
affine.for %i = 0 to 10 {
%0 = affine.load %input[%i] : memref<10xf32>
%2 = arith.addf %0, %one : f32
affine.store %2, %output[%i] : memref<10xf32>
}
%reduceval = affine.for %i = 0 to 10 iter_args(%sum = %zero) -> (f32) {
%0 = affine.load %input[%i] : memref<10xf32>
%1 = arith.addf %0, %sum : f32
affine.yield %1 : f32
}
return %reduceval : f32
}
```
In the 1-D case, this fusion and parallelization are successful with the passes `affine-loop-fusion,affine-parallelize{parallel-reductions}`. But, moving to the `2-D` case, the same strategy does not work:
```
func.func @mapandreduce2(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {
%zero = arith.constant 0. : f32
%one = arith.constant 1. : f32
affine.for %i = 0 to 10 {
affine.for %j = 0 to 10 {
%0 = affine.load %input[%i, %j] : memref<10x10xf32>
%2 = arith.addf %0, %one : f32
affine.store %2, %output[%i, %j] : memref<10x10xf32>
}
}
%reduceval = affine.for %i = 0 to 10 iter_args(%sum = %zero) -> (f32) {
%inner = affine.for %j = 0 to 10 iter_args(%sum2 = %sum) -> (f32) {
%0 = affine.load %input[%i, %j] : memref<10x10xf32>
%1 = arith.addf %0, %sum : f32
affine.yield %1 : f32
}
%res = arith.addf %inner, %sum : f32
affine.yield %res : f32
}
return %reduceval
}
```
In this case, the `affine-loop-fusion` pass successfully fuses the loops together, but the parallelization analysis does not succeed in parallelization. In contrast, the natural way of writing this (or a system might generate code this operation) as below already encodes the parallelism + reduction.
```
func.func @mapandreduce(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {
%zero = arith.constant 0. : f32
%one = arith.constant 1. : f32
affine.parallel (%i, %j) = (0, 0) to (10, 10) {
%0 = affine.load %input[%i, %j] : memref<10x10xf32>
%2 = arith.addf %0, %one : f32
affine.store %2, %output[%i, %j] : memref<10x10xf32>
}
%reduceval = affine.parallel (%i, %j) = (0, 0) to (10, 0) reduce ("addf") -> f32 {
%0 = affine.load %input[%i, %j] : memref<10xf32>
affine.yield %0 : f32
}
return %reduceval : f32
}
```
it would be great it the fusion pass could then fuse the above code into the below code, which seems like should be able to re-use a large amount of the analysis that exists in the fusion pass right now.
```
func.func @mapandreduce(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {
%zero = arith.constant 0. : f32
%one = arith.constant 1. : f32
%reduceval = affine.parallel (%i, %j) = (0, 0) to (10, 10) reduce ("addf") -> f32 {
%0 = affine.load %input[%i, %j] : memref<10x10xf32>
%2 = arith.addf %0, %one : f32
affine.store %2, %output[%i, %j] : memref<10x10xf32>
affine.yield %0 : f32
}
return %reduceval : f32
}
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzkmNtu4zgPx59GuSFiyHKcw0Uups0UX5_ig2LTsTqylJXkZjJPv6AU54RMT9spFlhgkEFkin-SMn-lIr1XG4O4ZOUdK1cj2YfWuqWzrTT70drW--VjgJ3tdQ1rBKMqBNWAbBplELS1W2h6r6yBrfQePVTRdGfdD7AG2JQn02wrndQaNZvyuM0zcQ9OhhYdhFYacPhXr5wyG5DRFwQLlTXP6MKZm8a6oweyeEEArKlwiE55qK3BjPEV49_S5__QIRMzD5KEKocBofcIlfQIj0zMOnC9MRSSMsGy4tv57ntrvKrRgYSmN1UgldDKAJ0k7ef4hE15Pl5RSB12DhtKOrRI2dZ9hQe70CJ42eFgdCnEpvzwL34lsYw-gE14J7fS1MkZE3MmSmW2fQBWfDs6u8_5z6YQrPhO6kyUtg8v2SxgzIrv0BQC2OwuiQLt-4XOAitWIJ0KbVZZ44M0AXgWfdH-ZEwaBm-Z5temcDpX2qbiJk4nm_Nz-RgATy7TDm1lDUPC9PaKUrFy9bu0zv2Is9BkXTfR91CcGPhFiMcgfbAOo4OLSr5ZnM1WpwqlQ3uW-jypm2VQAd3_pdv4dMK-7-Ljw4kcD4yJOcmJxVndPrFo-QtFSyHdLtpeoa4HBxcmp3KAw9A7c12WM-uj7VU3pM9HE3soH69i86YuU35ofmlqGPCgfsnYqtIh-L6q0Pum17BToY0-Dhg7cmVMLBknR0zcHxZP3pDN7oZv4xg8ufcU75RncNcHiqazz4SRYKMGm3KRqHCK9kAAH5wMuNlDbdGDsSGi9INEEC8h4Y1U-FQw_BkyXBo-XRvCe3oh1eLpVk_caIuP0eStPHlfLGfddGqWU9m_BDepIMoYdDdknl6REYOO77vXZN749-ADx_k-zr0KucuTSSfhb0jEor2A02uh5OUdPH0zR5W_wNJtEk55GtJOCNV7wi36uGmYzzZIAx65WvfhwNcrDhup9z6OZwfgRZdYgzLXxhk8GhrVgpM-DPEZGXonNezkHmwDO6dCRC3lwcTc0hjm9z5gB53atAE2aJAgC5WtMdnZLa1Ewi9AelijtjuQ2qGs94CGLP1l-L4DJu7gSPzss0e2fzufr8ZuOORybDnq2NjO89g9nBZoYBfzPC7k_BY7Prul383nz0fzOZh_w-EPFzEuJJdpr6AMmRC_eRv-aYWvy3tNJf7nZjx1dg3dOJQBVCLK2f3zcPuMNyyCUXwu1_b50Ox0jYtrqcFpjbLdtapqwSN2HrT6geDbQUmuNVLFHY7JnwQt3QZBdrY3gXgTFQaGxfsf_lQ-eMLXdXQu8sfYXfbfwcTnv_P5V7_0X3qJfF8cX9J_o3pZ1ItiIUe4zKdzUU7KyWI-apd8vuCLvClnDZ_xRV1ynHJeT2ouZTkrKzlSS8FFwSdiKvLJpJxnQkzrWb6Y5VUhmryZsAnHTiqdaf3cZdZtRsr7HpdTUSyKkZZr1D7-NCWEwR3Eh3TQ5WrklrRnvO43nk24pp47eQkqaFx2WjkmHlKRKFXfb7fWhYvfrWgGuXoj4_Qy6p1etiFsPd2-xAMTDxsV2n6dVbZj4oG0Dv-Nt84-YRWYeIgReiYeYgZ_BwAA__8u9YlT">