[PATCH] D109870: [AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"

Mon Sep 20 21:40:29 PDT 2021

hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp:199
+    auto *EI = &(*(EBB.getFirstInsertionPt()));
+    BasicBlock::reverse_iterator RIT(EBB.getTerminator());
+    while (!isa<AllocaInst>(*RIT) && (&*RIT != EI))
----------------
rampitec wrote:
> hsmhsm wrote:
> > arsenm wrote:
> > > hsmhsm wrote:
> > > > hsmhsm wrote:
> > > > > arsenm wrote:
> > > > > > Why is this using a reverse iterator? Why not just forward iterate since that's where the allocas are?
> > > > > (At least in the failed openmp test case), what I noticed is that alloca at the beginning are not contiguous one after other, but there are in between addressspace casts like below (I am not sure, if there is a possibility of other in-between instructions too other than addressspace). So I went with the reverse iteration. But, actually if we are able to successfully forward iterate, that would very neat.
> > > > > 
> > > > > alloca
> > > > > addresss-space-cast 
> > > > > alloca
> > > > > addresss-space-cast 
> > > > > ....
> > > > Another possibility is as below:
> > > > 
> > > > Since we only care about allocas inserted at the beginning of the block, and since we need to split the block before any call (since called functions is where the initialized pointers are accessed), and since it is safe to assume that all entry point allocas do appear before any call, we can forward traverse until first call, and split the block at that point.
> > > Can we fix the insertion of these allocas to not intersperse addrspacecasts?
> > Let me check if we can handle it.
> I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster them it does not seem to be a right place to do it here as it is a more general thing. Really I'd prefer to fix FE and produce a canonical IR.
Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify the requirements of this patch w.r.t alloca, and then come to a common conclusion based on it.

1.  ALLOCA:
=========

--  LLVM has a notion of static and dynamic allocas. Static allocas are those which appear in the *entry* basic block and their size is known at compile time. All other are dynamic allocas, including those which appear in the *entry* basic block but their size is not known.
--  Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between  addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call  in the entry block, at the beginning or near beginning of the entry block.
--  Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do.

2.  ALLOCA HANDLING IN CURRENT PATCH:
====================================

-- Naturally this patch should respect the placement of static allocas in the entry block and should split the entry block after all static allocas in the entry block. Note that this patch does not care about dynamic allocas.
-- Now, the question is how to find the last static alloca in the entry block  so that entry block split happens after it. There are two approaches.
   --- Forward iteration of entry block
   --- Backward iteration of entry block
-- Since the contiguity of static allocas are not guaranteed even though they are placed at the beginning or near beginning of the entry block, there is no neat way of forward iteration, however, backward iteration is kind of neat.
-- By noticing above, I had implemented the backward iteration first. But, Matt has asked me - why not forward iteration.
-- Hence I ended-up with the forward iterated static alloca clustering business as implemented in this patch.

3. WHAT IS NEXT?:
=============================

-- Go with backward iteration of entry block until we find very first static alloca in the backward direction.
-- Go with forward iteration. If we want to do this, we need a clustering work-around as implemented here. However, this work-around implementation should go as a separate work-around patch which we can revert later (as me and Matt discussed offline) when it is ensured that all static allocas are contiguous.

PS: Fixing the contiguity of static allocas in the entry block by changing FE or elsewhere OR handling AMDGPU backend issues (correctness issues) due to dynamic allocas is outside the scope of this patch, that we should take them as separate tasks. And, I do not have a luxury of keep on holding this patch until these issues are being investigated and fixed.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109870/new/

https://reviews.llvm.org/D109870