[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)
Fabian Ritter via llvm-commits
llvm-commits at lists.llvm.org
Wed Oct 16 02:47:21 PDT 2024
================
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
cl::desc("Maximum number of BBs allowed in a function after inlining"
" (compile time constraint)"));
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+ "amdgpu-memcpy-loop-unroll",
+ cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+ "operations when lowering memcpy as a loop, must be a power of 2"),
----------------
ritter-x2a wrote:
Then, something else must be broken, because using 12xi32 as a loop lowering type (with the assertions removed) gives me the following code on gfx1030 in the loop body:
```
.LBB0_1: ; %load-store-loop
; =>This Inner Loop Header: Depth=1
s_clause 0x2
flat_load_dwordx4 v[8:11], v[4:5]
flat_load_dwordx4 v[12:15], v[4:5] offset:16
flat_load_dwordx4 v[16:19], v[4:5] offset:32
s_add_u32 s6, s6, 1
s_addc_u32 s7, s7, 0
v_add_co_u32 v4, vcc_lo, v4, 64
v_cmp_lt_u64_e64 s4, s[6:7], 2
v_add_co_ci_u32_e32 v5, vcc_lo, 0, v5, vcc_lo
s_waitcnt vmcnt(2) lgkmcnt(2)
flat_store_dwordx4 v[6:7], v[8:11]
s_waitcnt vmcnt(1) lgkmcnt(2)
flat_store_dwordx4 v[6:7], v[12:15] offset:16
s_waitcnt vmcnt(0) lgkmcnt(2)
flat_store_dwordx4 v[6:7], v[16:19] offset:32
s_and_b32 vcc_lo, exec_lo, s4
v_add_co_u32 v6, s4, v6, 64
v_add_co_ci_u32_e64 v7, s4, 0, v7, s4
s_cbranch_vccnz .LBB0_1
```
Notably, an iteration of this loop copies 12*4 = 48 bytes, but increments the memory addresses by 64. I think the separate-const-offset-from-gep pass does that, I'll investigate more.
https://github.com/llvm/llvm-project/pull/112332
More information about the llvm-commits
mailing list