[PATCH] D152706: [AMDGPU] Use SSAUpdater in PromoteAlloca

Mon Jun 26 11:12:56 PDT 2023

arsenm requested changes to this revision.
arsenm added inline comments.
This revision now requires changes to proceed.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp:341-358
+  bool HasSeenBB = false;
+  BasicBlock *BB = I->getParent();
+  while (InsertIt != WorkList.end()) {
+    // -> Instruction is in the same basic block. If the new instruction comes
+    // before it, insert here.
+    //
+    // -> Instruction is in a different basic block: If we've already seen the
----------------
Pierre-vh wrote:
> arsenm wrote:
> > Pierre-vh wrote:
> > > arsenm wrote:
> > > > Pierre-vh wrote:
> > > > > arsenm wrote:
> > > > > > This whole thing feels wrong to me. You shouldn't need to figure out any ordering, much less by using comesBefore repeatedly
> > > > > I think that the issue was introduced because I'm looking through things such as bitcasts, which means that we can have IR like this:
> > > > > ```
> > > > > bb1:
> > > > >   %x = alloca ...
> > > > >   %cast.x = cast %x to ...
> > > > > 
> > > > > bb2: 
> > > > >   store ... %x
> > > > >   store ... %cast.x
> > > > > ```
> > > > > 
> > > > > Then, when I iterate the users of the alloca and build the worklist, the worklist may have the store to %cast.x before the store to %x, so I need to sort the worklist to use SSAAUpdater.
> > > > > 
> > > > > I'm not sure how `LoadStorePromoter` would help, I've looked at it but it seems to be for a different purpose
> > > > I applied your patch and not that many tests fail. The one I'm looking at is promote_memcpy_inline_aggr. This has the multiple loads and stores in the same block case the LoadAndStorePromoter comment says the base SSAUpdater does not handle. I think this is probably the correct tool, you just have the added complexity that we're coercing the type of the accesses to vector instead of reusing the original 
> > > `LoadAndStorePromoter` would need some modifications to work here or I would need to first rewrite all of the load/stores to use vectors then run promote alloca later to do the promotion. For instance, `LoadAndStorePromoter` initializes SSAUpdater with the type of the first load/store
> > > 
> > > I don't mind using it if you think it's better to first rewrite all the load/stores then run `LoadAndStoreUpdater`, but IMO it's really not necessary, we'd make things more complex (and less direct) just to avoid a 5 lines function (I heavily simplified it)
> > I think the real question is how does LoadAndStorePromoter deal with multiple loads and stores in the same block? I doubt it's using a worklist sort. Can you use whatever technique it's using?
> It uses a system of buckets
> ```
>   // First step: bucket up uses of the alloca by the block they occur in.
>   // This is important because we have to handle multiple defs/uses in a block
>   // ourselves: SSAUpdater is purely for cross-block references.
>   DenseMap<BasicBlock *, TinyPtrVector<Instruction *>> UsesByBlock;
> ```
> And if, when promoting an inst, it sees there's more than one inst in the bucket, it does a linear scan of the basic block to find & promote all users.
> 
> I used something similar at first but I just found it more complicated, when this case just a simple sorted worklist does the trick. The only important part really is that it sees users in a given basic block in program order. There's multiple ways to achieve that.
> 
> If you're more comfortable with that approach then I can use it too, and just simplify it as much as possible for our use case
I think that makes more sense, sorting isn't ideal

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152706/new/

https://reviews.llvm.org/D152706