[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)

Wed Oct 16 02:47:21 PDT 2024

================
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
     cl::desc("Maximum number of BBs allowed in a function after inlining"
              " (compile time constraint)"));
 
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+    "amdgpu-memcpy-loop-unroll",
+    cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+             "operations when lowering memcpy as a loop, must be a power of 2"),
----------------
ritter-x2a wrote:

Then, something else must be broken, because using 12xi32 as a loop lowering type (with the assertions removed) gives me the following code on gfx1030 in the loop body:
```
.LBB0_1:                                ; %load-store-loop
                                        ; =>This Inner Loop Header: Depth=1
        s_clause 0x2
        flat_load_dwordx4 v[8:11], v[4:5]
        flat_load_dwordx4 v[12:15], v[4:5] offset:16
        flat_load_dwordx4 v[16:19], v[4:5] offset:32
        s_add_u32 s6, s6, 1
        s_addc_u32 s7, s7, 0
        v_add_co_u32 v4, vcc_lo, v4, 64
        v_cmp_lt_u64_e64 s4, s[6:7], 2
        v_add_co_ci_u32_e32 v5, vcc_lo, 0, v5, vcc_lo
        s_waitcnt vmcnt(2) lgkmcnt(2)
        flat_store_dwordx4 v[6:7], v[8:11]
        s_waitcnt vmcnt(1) lgkmcnt(2)
        flat_store_dwordx4 v[6:7], v[12:15] offset:16
        s_waitcnt vmcnt(0) lgkmcnt(2)
        flat_store_dwordx4 v[6:7], v[16:19] offset:32
        s_and_b32 vcc_lo, exec_lo, s4
        v_add_co_u32 v6, s4, v6, 64
        v_add_co_ci_u32_e64 v7, s4, 0, v7, s4
        s_cbranch_vccnz .LBB0_1
```
Notably, an iteration of this loop copies 12*4 = 48 bytes, but increments the memory addresses by 64. I think the separate-const-offset-from-gep pass does that, I'll investigate more.


https://github.com/llvm/llvm-project/pull/112332