[llvm] [AMDGPU][PromoteAlloca] Support memsets to ptr allocas (PR #80678)

Mon Feb 5 07:20:50 PST 2024

================
@@ -521,10 +521,18 @@ static Value *promoteAllocaUserToVector(
       // For memset, we don't need to know the previous value because we
       // currently only allow memsets that cover the whole alloca.
       Value *Elt = MSI->getOperand(1);
-      if (DL.getTypeStoreSize(VecEltTy) > 1) {
-        Value *EltBytes =
-            Builder.CreateVectorSplat(DL.getTypeStoreSize(VecEltTy), Elt);
-        Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
+      const unsigned BytesPerElt = DL.getTypeStoreSize(VecEltTy);
+      if (BytesPerElt > 1) {
+        Value *EltBytes = Builder.CreateVectorSplat(BytesPerElt, Elt);
+
+        // If the element type of the vector is a pointer, we need to first cast
+        // to an integer, then use a PtrCast.
+        if (VecEltTy->isPointerTy()) {
----------------
Pierre-vh wrote:

@mariusz-sikora-at-amd I'm not sure, no strong opinion. I was thinking of doing it by flattening arrays (e.g. [2 x [3 x float]]) becomes [6 x float]. I think the tricky part is resolving the GEPs correctly, it might be a bigger refactoring than it looks like at first glance.

One alternative may be to have some kind of "alloca canonicalization" pass earlier that does the flattening for us to enable PromoteAlloca better.

@arsenm I haven't lost track of that but I also didn't find the time for it yet :/ 
Last time I thought about it I thought about changing the pass so it collects allocas, then sorts them by profitability (number of users + whether there are uses in loops), then just greedily promotes them one by one until it runs out of budget. Would that be good?

https://github.com/llvm/llvm-project/pull/80678