[llvm] [AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (PR #96185)

Tue Jun 25 07:28:54 PDT 2024

================
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
   MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
 
+  // Enable ganging up loads and stores in the memcpy DAG lowering.
----------------
ritter-x2a wrote:

@arsenm I ran some benchmarks for the other address spaces; enabling this optimization either improved or did not change the performance compared to the original memcpy implementation in all cases.
Which exact parameter value among {8, 16, 32, 64} we choose does not make much of a difference with different address spaces either; I now changed the parameter value in the PR to 16 instead of the previous 32 since that performed a bit better when copying from the generic to the global address space.

At least in my benchmarks on gfx1030, different parameter values for different address spaces seem unnecessary. 

Regarding test coverage, I can easily generate llc tests for various parameter combinations; the resulting test can however grow quite large. For combinations of the following parameters, a test for memmove and memcpy with auto-generated llc check lines reaches ~60k lines (ca. 1.5x the length of the largest AMDGPU codegen test so far):
```
  dst_address_spaces = [0, 1, 3, 5]
  src_address_spaces = [0, 1, 3, 4, 5]
  dst_alignments = [1, 2, 4]
  src_alignments = [1, 2, 4]
  sizes = [16, 31, 32]
```
Would that be a useful addition to our testing?

https://github.com/llvm/llvm-project/pull/96185