[PATCH] D109594: [AMDGPU] Initialize LDS pointers after alloca, but before call.

Mon Sep 13 09:20:35 PDT 2021

hsmhsm added a comment.

In D109594#2997132 <https://reviews.llvm.org/D109594#2997132>, @JonChesterfield wrote:

> In D109594#2997110 <https://reviews.llvm.org/D109594#2997110>, @foad wrote:
>
>> It's a QOI thing. You want to try hard to leave allocas in the entry block if possible, because LLVM convention is that allocas in the entry block are static...
>
> That sounds right. However, this transform, by moving calls out of the entry block, will itself have that effect if those calls are inlined.
>
> Alternatives are to emit the stores (probably as relaxed atomic) in the entry block, such that every lane executes it but we don't split the CFG, or to add a fairly late pass that hoists alloca into the entry bb.
>
> I'd be inclined to do both. Hoisting 'dynamic' alloca into entry will fix some miscompilation (I haven't looked recently, but ~ 6 months ago alloca outside of entry was an error in the backend) and/or make things faster. Emitting the store from all lanes instead of branching means, well, less branching, but also we don't rearrange the entry block into a CFG.
>
> If an atomic store of a uniform value is better expressed as masking off all lanes but one, I suspect we're better off doing that transform once exec is available for manipulation. Somewhere in MIR.

Often, (probably because of too many things on the one's plate) we tend to forget what we discussed and finalized in the past except one who actually works on it.

After lengthy discussion of possible ways of initializing LDS pointers at the entry block of kernels (including relaxed atomic one) we finalized about 0th lane from each wave to do initialization (which is nothing but the stores that you indicated above). So I did not take the route of relaxed atomic approach.

Yes store to LDS pointers should happen within the entry basic block of the kernel, before loading back them within non-kernel functions (that is before any call within kernel)

Now, we need to do these stores after aloca, otherwise aloca will be moved out of entry basic block into the newly split block which will have adverse side effect.

By the time this pass is run, inlining pass is already run, but, I am still safely checking for the possibility of alloca being living somewhere after call (in the entry block) and avoid running this pass in that case, which is a safe bet from the correctness point of view.

As Jay mentioned, though in theory it is true that alloca can live anywhere within the function, it is better and safe to cluster them at the beginning of the entry block and that is what usually happen most of the time. Please refer - (1) https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html   (2) https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109594/new/

https://reviews.llvm.org/D109594