[llvm] [LowerMemIntrinsics] Use i8 GEPs in memcpy/memmove lowering (PR #112707)

Mon Oct 21 02:28:55 PDT 2024

================
@@ -9,146 +9,128 @@ define void @issue63986(i64 %0, i64 %idxprom) {
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    v_lshlrev_b64 v[4:5], 6, v[2:3]
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
-; CHECK-NEXT:    v_lshlrev_b64 v[6:7], 6, v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], 0
-; CHECK-NEXT:  .LBB0_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB0_1: ; %loop-memcpy-expansion
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    v_mov_b32_e32 v9, s7
-; CHECK-NEXT:    v_mov_b32_e32 v8, s6
-; CHECK-NEXT:    flat_load_dwordx4 v[8:11], v[8:9]
-; CHECK-NEXT:    s_add_u32 s4, s4, 1
+; CHECK-NEXT:    v_mov_b32_e32 v7, s5
+; CHECK-NEXT:    v_mov_b32_e32 v6, s4
+; CHECK-NEXT:    flat_load_dwordx4 v[6:9], v[6:7]
----------------
ritter-x2a wrote:

I believe that this is because I changed the iteration scheme from iterating at a "number of load/store pairs" granularity with stride 1 to a "number of bytes" granularity with stride "store-size". For example, the generated IR therefore no longer computes the number of iterations. I think it would be valid for the optimizer to get to the same result with the old and the new IR via strength reduction, but it seems like the O3 pipeline reaches different results.

To leave the iteration scheme as it was, I would have needed to multiply the loop index with the store size in each iteration to compute the byte-wise GEP offset. That seemed messier to me (and I'm not sure if that would avoid changes in this test case).

https://github.com/llvm/llvm-project/pull/112707