[llvm] [mlir] [MLIR][AMDGPU] Adding dynamic size check to avoid subword buffer load (PR #135014)

Tue Apr 15 10:24:31 PDT 2025

================
@@ -149,6 +278,8 @@ struct AmdgpuTransferReadToLoadPass final
   void runOnOperation() override {
     RewritePatternSet patterns(&getContext());
     populateAmdgpuTransferReadToLoadPatterns(patterns);
-    walkAndApplyPatterns(getOperation(), std::move(patterns));
+    if (failed(applyPatternsGreedily(getOperation(), std::move(patterns)))) {
----------------
jerryyin wrote:

The populated IR becomes way cleaner with the greedy rewriter. It is able to fold the arith computes nicely into a constant. For example, previous with the walkAndApplyPattern I had:

```mlir
    %cst = arith.constant 0.000000e+00 : f32
    %base_buffer, %offset, %sizes:2, %strides:2 = memref.extract_strided_metadata %arg0 : memref<8x8xf32, #amdgpu.address_space<fat_raw_buffer>> -> memref<f32, #amdgpu.address_space<fat_raw_buffer>>, index, index, index, index, index
    %0 = affine.apply #map()[%arg1]
    %1 = affine.max #map1()[%strides#0, %sizes#0, %strides#1, %sizes#1]
    %c4 = arith.constant 4 : index
    %2 = arith.subi %1, %0 : index
    %3 = arith.cmpi ule, %2, %c4 : index
    %c4_0 = arith.constant 4 : index
    %4 = arith.muli %2, %c4_0 : index
    %c1 = arith.constant 1 : index
    %5 = arith.remui %4, %c1 : index
    %c0 = arith.constant 0 : index
    %6 = arith.cmpi ne, %5, %c0 : index
    %7 = arith.andi %3, %6 : i1
    %8 = scf.if %7 -> (vector<4xf32>) {
```

With greedy rewriter, it will figure out that for fp32 load this always evaluate to false so:

```mlir
    %false = arith.constant false
    %0 = scf.if %false -> (vector<4xf32>) {
```

https://github.com/llvm/llvm-project/pull/135014