[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)

Wed Oct 16 05:17:29 PDT 2024

================
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
     cl::desc("Maximum number of BBs allowed in a function after inlining"
              " (compile time constraint)"));
 
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+    "amdgpu-memcpy-loop-unroll",
+    cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+             "operations when lowering memcpy as a loop, must be a power of 2"),
----------------
arsenm wrote:

So the GEPs are used incorrectly. You can always do the indexing in byte units. You don't need to preserve the type this way 

https://github.com/llvm/llvm-project/pull/112332