[llvm] [AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (PR #112332)

Tue Oct 15 23:31:28 PDT 2024

================
@@ -75,6 +75,13 @@ static cl::opt<size_t> InlineMaxBB(
     cl::desc("Maximum number of BBs allowed in a function after inlining"
              " (compile time constraint)"));
 
+// This default unroll factor is based on microbenchmarks on gfx1030.
+static cl::opt<unsigned> MemcpyLoopUnroll(
+    "amdgpu-memcpy-loop-unroll",
+    cl::desc("Unroll factor (affecting 4x32-bit operations) to use for memory "
+             "operations when lowering memcpy as a loop, must be a power of 2"),
----------------
ritter-x2a wrote:

When it is not a power of two, the `(DL.getTypeStoreSize(LoopOpType) == DL.getTypeAllocSize(LoopOpType))` assertions that this patch introduces in LowerMemIntrinsics.cpp fail. The allocation sizes of these non-standard types are rounded up to the next power of two in our backend (because of their default alignment).
Mismatching alloc and store sizes are a problem because
- the gep-based offset computation in the memcpy/move lowering refers to the alloc size whereas
- how many bytes are read/written by the load/store operations is determined by the store size.

Is there a better way to enforce that constraint? I thought about changing the option to specify the exponent instead to avoid representing invalid state, but that seems not very intuitive from a usage perspective.

https://github.com/llvm/llvm-project/pull/112332