[PATCH] D109870: [AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"

Tue Sep 21 00:24:50 PDT 2021

hsmhsm marked an inline comment as done.
hsmhsm added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp:199
+    auto *EI = &(*(EBB.getFirstInsertionPt()));
+    BasicBlock::reverse_iterator RIT(EBB.getTerminator());
+    while (!isa<AllocaInst>(*RIT) && (&*RIT != EI))
----------------
hsmhsm wrote:
> JonChesterfield wrote:
> > hsmhsm wrote:
> > > hsmhsm wrote:
> > > > rampitec wrote:
> > > > > hsmhsm wrote:
> > > > > > arsenm wrote:
> > > > > > > hsmhsm wrote:
> > > > > > > > hsmhsm wrote:
> > > > > > > > > arsenm wrote:
> > > > > > > > > > Why is this using a reverse iterator? Why not just forward iterate since that's where the allocas are?
> > > > > > > > > (At least in the failed openmp test case), what I noticed is that alloca at the beginning are not contiguous one after other, but there are in between addressspace casts like below (I am not sure, if there is a possibility of other in-between instructions too other than addressspace). So I went with the reverse iteration. But, actually if we are able to successfully forward iterate, that would very neat.
> > > > > > > > > 
> > > > > > > > > alloca
> > > > > > > > > addresss-space-cast 
> > > > > > > > > alloca
> > > > > > > > > addresss-space-cast 
> > > > > > > > > ....
> > > > > > > > Another possibility is as below:
> > > > > > > > 
> > > > > > > > Since we only care about allocas inserted at the beginning of the block, and since we need to split the block before any call (since called functions is where the initialized pointers are accessed), and since it is safe to assume that all entry point allocas do appear before any call, we can forward traverse until first call, and split the block at that point.
> > > > > > > Can we fix the insertion of these allocas to not intersperse addrspacecasts?
> > > > > > Let me check if we can handle it.
> > > > > I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster them it does not seem to be a right place to do it here as it is a more general thing. Really I'd prefer to fix FE and produce a canonical IR.
> > > > Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify the requirements of this patch w.r.t alloca, and then come to a common conclusion based on it.
> > > > 
> > > > 1.  ALLOCA:
> > > > =========
> > > > 
> > > > --  LLVM has a notion of static and dynamic allocas. Static allocas are those which appear in the *entry* basic block and their size is known at compile time. All other are dynamic allocas, including those which appear in the *entry* basic block but their size is not known.
> > > > --  Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between  addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call  in the entry block, at the beginning or near beginning of the entry block.
> > > > --  Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do.
> > > > 
> > > > 2.  ALLOCA HANDLING IN CURRENT PATCH:
> > > > ====================================
> > > > 
> > > > -- Naturally this patch should respect the placement of static allocas in the entry block and should split the entry block after all static allocas in the entry block. Note that this patch does not care about dynamic allocas.
> > > > -- Now, the question is how to find the last static alloca in the entry block  so that entry block split happens after it. There are two approaches.
> > > >    --- Forward iteration of entry block
> > > >    --- Backward iteration of entry block
> > > > -- Since the contiguity of static allocas are not guaranteed even though they are placed at the beginning or near beginning of the entry block, there is no neat way of forward iteration, however, backward iteration is kind of neat.
> > > > -- By noticing above, I had implemented the backward iteration first. But, Matt has asked me - why not forward iteration.
> > > > -- Hence I ended-up with the forward iterated static alloca clustering business as implemented in this patch.
> > > > 
> > > > 3. WHAT IS NEXT?:
> > > > =============================
> > > > 
> > > > -- Go with backward iteration of entry block until we find very first static alloca in the backward direction.
> > > > -- Go with forward iteration. If we want to do this, we need a clustering work-around as implemented here. However, this work-around implementation should go as a separate work-around patch which we can revert later (as me and Matt discussed offline) when it is ensured that all static allocas are contiguous.
> > > > 
> > > > PS: Fixing the contiguity of static allocas in the entry block by changing FE or elsewhere OR handling AMDGPU backend issues (correctness issues) due to dynamic allocas is outside the scope of this patch, that we should take them as separate tasks. And, I do not have a luxury of keep on holding this patch until these issues are being investigated and fixed.
> > > With the above clarification, I will again stick to backward iteration of entry block and get rid of entering into unnecessary business of alloca clustering (from this patch perspective). 
> > > --  Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between  addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call  in the entry block, at the beginning or near beginning of the entry block.
> > > --  Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do.
> > 
> > If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot.
> > 
> > If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions.
> >> If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot.
> 
> This is *not* something which is required from correctness perspective, which is something required from *better optimization* perspective. So, IR verifier cannot reject other forms as you mention above. And it is documented here https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas after related discussion happened here https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html
> 
> >> If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions.
> 
> As I clarified in my previous lengthy comment, all I need to do here is just to respect static alloca placement (from better optimization perspective). I do not understand why I need to do what you tell above. If application breaks because of dynamic allocas or something else, then it is altogether a different issue and not related to breaking of functionality due to this patch. Let's root cause and fix them when we encounter them. Why should I overcomplicate this patch by adding unrelated stuff into this patch?
Also [if you have not yet read it], please go through the reply's to the related alloca question that I had recently posted to llvm-dev. Search for the message - Alloca, Inlining and Entry Basic Block.

Though, I have not yet replied to it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109870/new/

https://reviews.llvm.org/D109870