[llvm] [AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (PR #96185)

Thu Jun 20 07:40:15 PDT 2024

================
@@ -67,6 +67,9 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
   MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = ~0U;
   MaxStoresPerMemmove = MaxStoresPerMemmoveOptSize = ~0U;
 
+  // Enable ganging up loads and stores in the memcpy DAG lowering.
----------------
ritter-x2a wrote:

> I mean it's a problem when things get added for only a single target out of fear of regressing anything else. Also I don't see what target property would correspond to a value here. It's in terms of store count, not even bytes or load width or anything. I'm saying this is a bad API

Another argument for that: The limit as implemented is actually not concerning only stores, but loads and stores. So 32 here means that packets of 16 loads and 16 stores are ganged up.
I take it that you suggest that we improve the API before using it?

I think the corresponding hardware property for AMDGPU would be something like the number of memory accesses that can be started before the first one finishes.
In AArch64, it is the number of memory operations that can be merged into one (i.e. 2 for ldp and stp).
For architectures with traditional vector operations, it might be the preferred number of vector lanes.

> We'd probably want different numbers based on the address spaces involved

I'll do some benchmarks for other address spaces to find out, then.

https://github.com/llvm/llvm-project/pull/96185