<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/62393>62393</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            mlir/affine: support loop fusion on affine.parallel loops

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          rohany

      </td>

    </tr>

</table>

<pre>

    It would be nice if affine loop fusion passes could work on `affine.parallel` loops, rather than requiring a pass to convert `affine.for` loops to `affine.parallel` loops once fusion is done.

Here's a concrete use case I'm running into:

Consider a function that maps over a `1-D` memref, then reduces over the same memref:

```

func.func @mapandreduce(%input : memref<10xf32>, %output : memref<10xf32>) -> f32 {

  %zero = arith.constant 0. : f32

 %one = arith.constant 1. : f32

  affine.for %i = 0 to 10 {

    %0 = affine.load %input[%i] : memref<10xf32>

    %2 = arith.addf %0, %one : f32

    affine.store %2, %output[%i] : memref<10xf32>

  }

 %reduceval = affine.for %i = 0 to 10 iter_args(%sum = %zero) -> (f32) {

 %0 = affine.load %input[%i] : memref<10xf32>

    %1 = arith.addf %0, %sum : f32

    affine.yield %1 : f32

  }

  return %reduceval : f32

}

```

In the 1-D case, this fusion and parallelization are successful with the passes `affine-loop-fusion,affine-parallelize{parallel-reductions}`. But, moving to the `2-D` case, the same strategy does not work:

```

func.func @mapandreduce2(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {

  %zero = arith.constant 0. : f32

  %one = arith.constant 1. : f32

  affine.for %i = 0 to 10 {

    affine.for %j = 0 to 10 { 

 %0 = affine.load %input[%i, %j] : memref<10x10xf32>

      %2 = arith.addf %0, %one : f32

      affine.store %2, %output[%i, %j] : memref<10x10xf32>

    }

  }

  %reduceval = affine.for %i = 0 to 10 iter_args(%sum = %zero) -> (f32) {

    %inner = affine.for %j = 0 to 10 iter_args(%sum2 = %sum) -> (f32) {

      %0 = affine.load %input[%i, %j] : memref<10x10xf32>

      %1 = arith.addf %0, %sum : f32

 affine.yield %1 : f32

     }

   %res = arith.addf %inner, %sum : f32

   affine.yield %res : f32

  }

  return %reduceval

}

```

In this case, the `affine-loop-fusion` pass successfully fuses the loops together, but the parallelization analysis does not succeed in parallelization. In contrast, the natural way of writing this (or a system might generate code this operation) as below already encodes the parallelism + reduction.

```

func.func @mapandreduce(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {

  %zero = arith.constant 0. : f32

  %one = arith.constant 1. : f32

 affine.parallel (%i, %j) = (0, 0) to (10, 10) {

    %0 = affine.load %input[%i, %j] : memref<10x10xf32>

    %2 = arith.addf %0, %one : f32

 affine.store %2, %output[%i, %j] : memref<10x10xf32>

  }

  %reduceval = affine.parallel (%i, %j) = (0, 0) to (10, 0) reduce ("addf") -> f32 {

 %0 = affine.load %input[%i, %j] : memref<10xf32>

    affine.yield %0 : f32

  }

  return %reduceval : f32

}

```

it would be great it the fusion pass could then fuse the above code into the below code, which seems like should be able to re-use a large amount of the analysis that exists in the fusion pass right now.

```

func.func @mapandreduce(%input : memref<10x10xf32>, %output : memref<10x10xf32>) -> f32 {

  %zero = arith.constant 0. : f32

  %one = arith.constant 1. : f32

  %reduceval = affine.parallel (%i, %j) = (0, 0) to (10, 10) reduce ("addf") -> f32 {

 %0 = affine.load %input[%i, %j] : memref<10x10xf32>

    %2 = arith.addf %0, %one : f32

    affine.store %2, %output[%i, %j] : memref<10x10xf32>

    affine.yield %0 : f32

  }

  return %reduceval : f32

}

```

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzkmNtu4zgPx59GuSFiyHKcw0Uups0UX5_ig2LTsTqylJXkZjJPv6AU54RMT9spFlhgkEFkin-SMn-lIr1XG4O4ZOUdK1cj2YfWuqWzrTT70drW--VjgJ3tdQ1rBKMqBNWAbBplELS1W2h6r6yBrfQePVTRdGfdD7AG2JQn02wrndQaNZvyuM0zcQ9OhhYdhFYacPhXr5wyG5DRFwQLlTXP6MKZm8a6oweyeEEArKlwiE55qK3BjPEV49_S5__QIRMzD5KEKocBofcIlfQIj0zMOnC9MRSSMsGy4tv57ntrvKrRgYSmN1UgldDKAJ0k7ef4hE15Pl5RSB12DhtKOrRI2dZ9hQe70CJ42eFgdCnEpvzwL34lsYw-gE14J7fS1MkZE3MmSmW2fQBWfDs6u8_5z6YQrPhO6kyUtg8v2SxgzIrv0BQC2OwuiQLt-4XOAitWIJ0KbVZZ44M0AXgWfdH-ZEwaBm-Z5temcDpX2qbiJk4nm_Nz-RgATy7TDm1lDUPC9PaKUrFy9bu0zv2Is9BkXTfR91CcGPhFiMcgfbAOo4OLSr5ZnM1WpwqlQ3uW-jypm2VQAd3_pdv4dMK-7-Ljw4kcD4yJOcmJxVndPrFo-QtFSyHdLtpeoa4HBxcmp3KAw9A7c12WM-uj7VU3pM9HE3soH69i86YuU35ofmlqGPCgfsnYqtIh-L6q0Pum17BToY0-Dhg7cmVMLBknR0zcHxZP3pDN7oZv4xg8ufcU75RncNcHiqazz4SRYKMGm3KRqHCK9kAAH5wMuNlDbdGDsSGi9INEEC8h4Y1U-FQw_BkyXBo-XRvCe3oh1eLpVk_caIuP0eStPHlfLGfddGqWU9m_BDepIMoYdDdknl6REYOO77vXZN749-ADx_k-zr0KucuTSSfhb0jEor2A02uh5OUdPH0zR5W_wNJtEk55GtJOCNV7wi36uGmYzzZIAx65WvfhwNcrDhup9z6OZwfgRZdYgzLXxhk8GhrVgpM-DPEZGXonNezkHmwDO6dCRC3lwcTc0hjm9z5gB53atAE2aJAgC5WtMdnZLa1Ewi9AelijtjuQ2qGs94CGLP1l-L4DJu7gSPzss0e2fzufr8ZuOORybDnq2NjO89g9nBZoYBfzPC7k_BY7Prul383nz0fzOZh_w-EPFzEuJJdpr6AMmRC_eRv-aYWvy3tNJf7nZjx1dg3dOJQBVCLK2f3zcPuMNyyCUXwu1_b50Ox0jYtrqcFpjbLdtapqwSN2HrT6geDbQUmuNVLFHY7JnwQt3QZBdrY3gXgTFQaGxfsf_lQ-eMLXdXQu8sfYXfbfwcTnv_P5V7_0X3qJfF8cX9J_o3pZ1ItiIUe4zKdzUU7KyWI-apd8vuCLvClnDZ_xRV1ynHJeT2ouZTkrKzlSS8FFwSdiKvLJpJxnQkzrWb6Y5VUhmryZsAnHTiqdaf3cZdZtRsr7HpdTUSyKkZZr1D7-NCWEwR3Eh3TQ5WrklrRnvO43nk24pp47eQkqaFx2WjkmHlKRKFXfb7fWhYvfrWgGuXoj4_Qy6p1etiFsPd2-xAMTDxsV2n6dVbZj4oG0Dv-Nt84-YRWYeIgReiYeYgZ_BwAA__8u9YlT">