[llvm] [AMDGPU] Improve isBasicBlockPrologue to only add necessary instructions (PR #113303)

Wed Oct 30 11:11:51 PDT 2024

alex-t wrote:

> Thanks for doing this! But frankly speaking, this is making the implementation more confusing. I think most of the weirdness in the implementation is caused by the way the target hook is defined. Querying whether each instruction is block prologue seems not the right way to go.
> 
> The basic idea behind the block prologue is that we have a special instruction that setup the exec for the block, all the instructions before this specific instruction are prologue instructions. So the prologue searching process should work like forward-iterating the instructions until we see the first instruction that modifies exec, which would be the last prologue instruction. There are ways to speed up the searching process to avoid visiting all the instructions in the worst case. For example, we can check the instruction types that could possibly in the prologue. If we found an instruction that is **NOT** in the list of possible prologue instructions before we see an exec setup instruction, then we should return and report that there is no prologue instruction at all.
> 
> Following the idea, maybe we should define the hook as `skipBlockPrologue(MachineBasicBlock, beginIterator)` which would checking for prologue instruction from the `beginIterator`, and return the iterator right after the last prologue instruction (if there is no prologue instruction, just return the beginIterator). Then the implementation would be a simple forward iteration over the instructions with no recursion. With the returned iterator, I think the caller should be able to do whatever they want on the prologue instructions. Sounds reasonable?

The current implementation aims to check if the instruction in question produces operands for another prologue instruction, which causes the recursion. The possible user of the register defined by the instruction might be not only the last instruction in the prologue (i.e. EXEC modification) but, for example, WWM-reload using SGPR as offset.

Also, please note, that the latest version of the algorithm returns false immediately as the next instruction is not a "prologue".
So, we never walk along all the instructions in the block. In the worst case, we walk along the prologue sequence, which usually is not very long.

I agree that iterating through the whole prologue on each query is inconvenient. To change that we'd have to change all the call sites in LLVM core code. Most are looking for the insertion point for the concrete basic block walking from the block beginning.
Such a change would affect all targets but we could create a hook that does the same walk by default calling isBasicBlockPrologue but does something smart for AMDGPU.
The problem is that the TargetInstrInfo is a stateless class so we cannot store information in it. The best approach would be to lazily compute a kind of a MachineInstr property (a flag?) and cache it. If we could, the query for the "prologue" property is immediate.
I was thinking of adding a bit in tsFlags as I know there exist few free positions. 

https://github.com/llvm/llvm-project/pull/113303