[llvm] [LowerMemIntrinsics] Use i8 GEPs in memcpy/memmove lowering (PR #112707)

Tue Oct 22 05:02:02 PDT 2024

================
@@ -9,146 +9,128 @@ define void @issue63986(i64 %0, i64 %idxprom) {
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    v_lshlrev_b64 v[4:5], 6, v[2:3]
 ; CHECK-NEXT:    s_mov_b64 s[4:5], 0
-; CHECK-NEXT:  ; %bb.1: ; %loop-memcpy-expansion.preheader
-; CHECK-NEXT:    v_lshlrev_b64 v[6:7], 6, v[2:3]
-; CHECK-NEXT:    s_mov_b64 s[6:7], 0
-; CHECK-NEXT:  .LBB0_2: ; %loop-memcpy-expansion
+; CHECK-NEXT:  .LBB0_1: ; %loop-memcpy-expansion
 ; CHECK-NEXT:    ; =>This Inner Loop Header: Depth=1
-; CHECK-NEXT:    v_mov_b32_e32 v9, s7
-; CHECK-NEXT:    v_mov_b32_e32 v8, s6
-; CHECK-NEXT:    flat_load_dwordx4 v[8:11], v[8:9]
-; CHECK-NEXT:    s_add_u32 s4, s4, 1
+; CHECK-NEXT:    v_mov_b32_e32 v7, s5
+; CHECK-NEXT:    v_mov_b32_e32 v6, s4
+; CHECK-NEXT:    flat_load_dwordx4 v[6:9], v[6:7]
----------------
ritter-x2a wrote:

The preheader is gone and the main loop is also a bit shorter, with 13 instead of 15 ISA instructions, in this test. In terms of runtime, this doesn't make much of a difference: I ran some memcpy/memmove microbenchmarks on gfx1030 with and without the change, and in almost all cases the runtime is identical. A couple of well-aligned dynamically-sized (but small) memcpy cases in global memory are a bit faster with the new lowering.

https://github.com/llvm/llvm-project/pull/112707