[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)

Wed Oct 23 05:30:20 PDT 2024

================
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
     cl::desc("Maximum number of BBs allowed in a function after inlining"
              " (compile time constraint)"));
 
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+    "amdgpu-memcpy-loop-unroll",
+    cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+             "operations when lowering memcpy as a loop, must be a power of 2"),
----------------
ritter-x2a wrote:

I just rebased this PR on top of trunk, including the merged #112707, removed the then obsolete StoreSize==AllocSize assertions, and added test checks where the assertions would be violated in 734289baec944c8ccf495f012bf58d1bff168f3b.

https://github.com/llvm/llvm-project/pull/112332