[PATCH] D143731: [AMDGPU] Break-up large PHIs for DAGISel

Tue Feb 14 00:14:21 PST 2023

Pierre-vh marked an inline comment as done.
Pierre-vh added a comment.

In D143731#4124715 <https://reviews.llvm.org/D143731#4124715>, @arsenm wrote:

> In D143731#4122195 <https://reviews.llvm.org/D143731#4122195>, @Pierre-vh wrote:
>
>> I kept the threshold at 256 for now because lower values make the transform kick more often and it's not always clear to me whether it's a positive or negative thing. Some tests have more instructions, some have a bit less, etc.
>> I think 256 is good enough to get started, we can always tweak it in a separate commit if benchmarking proves that it's useful. What do you think?
>>
>> Also the following tests are currently broken:
>>
>> - `test_mfma_loop_32xfloat` in `acc-ldst.ll`: We get accvgpr read/writes instead of a load to agpr directly.
>
> Ideally this is the kind of case RegBankSelect is supposed to solve
>
>> I think we need an additional heuristic but I'm not sure what. Maybe we should not transform when one of the value comes from a load (we know there won't be an undef so it should be good) or comes from the same block (loop)?
>> I'm also not sure if extract/insert subvector with a constant operand (in the case of zeroinit) gets folded out; do I need to special-case it?
>
> Really this is a register bank select problem. Fundamentally we always want 32-bit phis. Any regression from not using 32-bit phis is theoretically a failure with some other optimization

In this case I think it's a SIFoldOperand issue (or something strange happening higher that blocks SIFoldOperands). It's not folding the AGPR copies away.
If PHIs aren't broken, we get the areg copy in the entry block:

  %181:vreg_1024_align2 = REG_SEQUENCE (..vreg128s)
  %1:vreg_1024_align2 = COPY %181:vreg_1024_align2
  %243:areg_1024_align2 = COPY %1:vreg_1024_align2, implicit $exec

But when PHIs are broken, we get it in the loop closer to V_MFMA and after the PHIs. I suspect SIFoldOperand doesn't handle the PHI node to "propagate" the areg class up until it reaches the store.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1415
+
+  return B.CreateExtractVector(SubVecTy, Val, B.getInt64(Idx));
+}
----------------
arsenm wrote:
> This must be a new intrinsic, I didn't know it existed. Usually you would do a shufflevector to get the reduced elements
If shufflevector has better handling and gets folded out for sure then it might be simpler to use it.
Do I just do a shufflevector with the same operand twice + the mask of the indexes I want?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1438
+    bool IsVec = false;
+    std::vector<Value *> IncomingValues = {};
+    PHINode *NewPHI = nullptr;
----------------
arsenm wrote:
> Don't need = {}
``` 
warning: missing field 'IncomingValues' initializer [-Wmissing-field-initializers]
```

It's needed because I use brace initializers

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D143731/new/

https://reviews.llvm.org/D143731