<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/101451>101451</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[mlir] Bufferization issue with redundant allocation and copy
</td>
</tr>
<tr>
<th>Labels</th>
<td>
mlir
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
yzhang93
</td>
</tr>
</table>
<pre>
There seems to be some bufferization issue that creates redundant data allocation and copy. The issue is happened if there is no `tensor.extract_slice` of the argument in the loop body and the argument is used directly in the `tensor.parallel_insert_slice`.
For example, the below IR doesn't have any redundant alloc ops and copies after bufferization.
```
%6 = scf.forall (%arg0, %arg1) = (0, 0) to (64, 64) step (64, 64) shared_outs(%arg2 = %5) -> (tensor<64x64xi32>) {
%extracted_slice_6 = tensor.extract_slice %arg2[%arg0, %arg1] [64, 64] [1, 1] : tensor<64x64xi32> to tensor<64x64xi32>
...
...
%unpack = tensor.unpack %14#1 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %extracted_slice_6 : tensor<2x2x32x32xi32> -> tensor<64x64xi32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack into %arg2[%arg0, %arg1] [64, 64] [1, 1] : tensor<64x64xi32> into tensor<64x64xi32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
```
However, when the unpack results are directly written to the shared_outs argument, it has unexpected allocations and copies.
Before bufferization
```
%8 = scf.forall (%arg0, %arg1) = (0, 0) to (64, 64) step (64, 64) shared_outs(%arg2 = %7) -> (tensor<64x64xi32>) {
...
...
%unpack = tensor.unpack %18#2 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %7 : tensor<2x2x32x32xi32> -> tensor<64x64xi32>
scf.forall.in_parallel {
tensor.parallel_insert_slice %unpack into %arg2[%arg0, %arg1] [64, 64] [1, 1] : tensor<64x64xi32> into tensor<64x64xi32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
```
After bufferization
```
%alloc_6 = memref.alloc() : memref<64x64xi32, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%3 : memref<64x64xi32, #hal.descriptor_type<storage_buffer>>) outs(%alloc_6 : memref<64x64xi32, 2 : i32>) {
^bb0(%in: i32, %out: i32):
linalg.yield %in : i32
}
scf.forall (%arg0, %arg1) = (0, 0) to (64, 64) step (64, 64) {
...
...
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%3 : memref<64x64xi32, #hal.descriptor_type<storage_buffer>>) outs(%alloc_14 : memref<64x64xi32, 2 : i32>) {
^bb0(%in: i32, %out: i32):
linalg.yield %in : i32
}
iree_linalg_ext.unpack %alloc_5 inner_dims_pos = [0, 1] inner_tiles = [32, 32] into %alloc_14 : (memref<2x2x32x32xi32, 1 : i32> memref<64x64xi32, 2 : i32>)
%subview_15 = memref.subview %alloc_6[%arg0, %arg1] [64, 64] [1, 1] : memref<64x64xi32, 2 : i32> to memref<64x64xi32, strided<[64, 1], offset: ?>, 2 : i32>
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloc_14 : memref<64x64xi32, 2 : i32>) outs(%subview_15 : memref<64x64xi32, strided<[64, 1], offset: ?>, 2 : i32>) {
^bb0(%in: i32, %out: i32):
linalg.yield %in : i32
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloc_6 : memref<64x64xi32, 2 : i32>) outs(%3 : memref<64x64xi32, #hal.descriptor_type<storage_buffer>>) {
^bb0(%in: i32, %out: i32):
linalg.yield %in : i32
}
```
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWF2PozoS_TXOS2kQGMjHQx6mOxPtvq7mPTJQgHeMjWzTncyvX9kmCUynP6anZ69070gR3S7b5apT55QRzBjeSMQtye9IvluwwbZKb0_fWyabTbooVHXafm1RIxjEzoBVUCAY1SEUQ12j5t-Z5UoCN2ZAsC2zUGpkFg1orAZZMWmhYpYBE0KVYTGTFZSqP0XwtcVxKzfQsr5HiRXwGqw_lBuQCsgytiiN0hEerWalPRjBSyTLGJRfCUw3Q4fSApd-LJTqwQXvj5qvMDAYrKDiGksrTuct10N6ppkQKA5cGtTXwyIS70j8OTz3SgMeWdcLJPTeeyhQqEf493-gUmgkoSsLLXtAYPI0wcLDAKo3ZxQ4GmC1RT1HdHYaWcbjLwxpvgSS7sCUdVQrFy4QuiY0Z7qJXTzh34TQjV9H6NqbY2ewyo2XmTO45waMxf6JrWUaq4MarLm4pqOzPHcrPpH0i9sVYCPp_TI7LrMjTylJv_iTV3chXgBwu8bqYRUgPYQcbpV2jJ86Wj5NKt8Bye8uwYZh4kZhLv0MN2Nymd8O9hJlFEXPDAjNB9mz8ts06LOF5klGaJoAlxL1oeKdOfTKBLjyu_gSW5i3XOBlMqVuNqVh2hfnJlKTpOiRHlP_GxPzpXgltStXIi4PZ5LPiwTwkgYmGJwD_Q1F8q5fyYWsducBWe1cDh3rey6bM6iEpk0_RIVQ5TeS3p88I11wM_PRmfOdd3FTZ_75L_WID6jd_scWQ7cYYdBoBmENMI3XhvKoubVunfJLJ0K6dCHni7v2YGCQeOzR1XrSIafNYdYI7rBW-ofe-1yLWP-lLWL18y3iGfG9prw1oSn9KOWtflVpf2T2gsxeUNnnp1fgyzegV8t4h3TYaawjb_JU3PjMg3mWHr0H6ufmuQoumWiiBiVqXrpMuazwyGVz6Fh_YQyray7RmUh6T-i68rWokinXJ7YRjnfsynehRaBmVumDPfVX2hJKz7whlAa8Z5bQ0bgcVZm-gAWhactEVKEpNe_PR5H03lilWYOHUBAXUhDsROwX_N8E9FzsJP9SFHFwxOV5WeC1GuzFsCHpSIBziU4cRQV-28X9laNPRPjhTe8tHQv-EOrdhEqy9zHqN3Fq1gQBuEY8hF0HPNrJNRSizz_qHpqBQej6gsf8RnJeJ5i8DbbrvWqG4oHj4yHJp210tF7DWL777nk9Hie726uM1bzCylH8fEoy8ljVtUFfUZLuR1Hcaut_bx2-QzBXtc1K_9z2XynBB8gTJhV8q0Y_8kXln0Cfn7nAr-z50Cvg__Ze8MNL5KLaptUm3bAFbpMVpdk6XyfZot3mVV6WdJlkSUlXeVKtV1W8WcarsmbLNW7iBd_SmGbxOk7iLFnmWUTXrIgxL5YbZBssS5LF2DEuIiEeukjpZuG_cm2TOMnyZCFYgcJsQ9E6wXUoz0Jv3fpPxdAYksWCG2uuHiy3wn-p8xvyHdzd-AL3yG3749em-Ue3xaDFtrW2Nw5Duid033DbDkVUqo7QvTtv_POp1-q_WFpC9967IXQ_ZvCwpf8LAAD__31GspA">