[PATCH] D128252: [AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.

Thu Jul 14 06:53:42 PDT 2022

alex-t added a comment.

In D128252#3651612 <https://reviews.llvm.org/D128252#3651612>, @nhaehnle wrote:

> Perhaps I'm blind, but I don't see where the heuristic takes into account the number of ALU instructions that would be moved from SALU to VALU.

I am not sure that I understand the question.
Currently, we only have VGPR to SGPR copies for the cases where uniform instruction produces VGPR. It's uniform users that are SALU need SGPR.
For each copy we have a choice - convert all its users to VALU or replace the copy with v_readfirstlane_b32.  The algorithm computes the tradeoff.
Inserting v_readfirstlane_b32 we add yet one more VALU instruction. If the SALU chain spawned by the copy is 2 instructions long, inserting v_readfirstlane_b32 is an overkill.
We agreed on the heuristic: 1 v_readfirstlane_b32 for at least 3 SALU.
The score is the length of SALU chain minus the number of v_readfirstlane_b32 to insert and minus the number of SGPR to VGPR copies that need to get the result back to VALU.

So, if we decide to move the whole chain to VALU it already means that keeping it SALU (by replacing the copy with v_readfirstlane_b32) is not profitable.
The number of the SALU instructions that are converting VALU, in this case, is the SChain.size()

================
Comment at: llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp:1134
+      else
+        MIB.addReg(SrcReg);
+    } else {
----------------
rampitec wrote:
> rampitec wrote:
> > alex-t wrote:
> > > rampitec wrote:
> > > > What happens to 16 bit subregs?
> > > VGPR to SGPR copies are inserted by InstrEmitter to adjust the VALU result to the SALU consumer.
> > > The 16bits in VGPR result are packed and adjusted to the consumer by inserting the EXCTRACT_ELEMENT lowered in another place.
> > > What kind of adjustment would you recommend if we have a 16bit VGPR source?
> > > Zero-extend it to 32bit?
> > > 
> > Assume the input like:
> > ```
> > %0:SGPR_LO16 = COPY %1.lo16:VGPR_32
> > ```
> > If I read it right it will produce V_READFIRSTLANE_B32 with a 16 bit destination and source, which does not work. Assume that selection managed to produce such input, which path will it take here?
> JBTW, right now it seems to go via moveToVALU:
> 
> ```
> # RUN: llc -march=amdgcn -mcpu=gfx1100 -run-pass=si-fix-sgpr-copies -verify-machineinstrs -o - %s
> 
> ---
> name:            v16_to_s16
> body:             |
>   bb.0:
>     %0:vgpr_32 = IMPLICIT_DEF
>     %1:sgpr_lo16 = COPY %0.lo16:vgpr_32
>     %2:vgpr_lo16 = COPY %1
>     S_ENDPGM 0, implicit %1, implicit %2
> ...
> ```
> Results in:
> ```
>     %0:vgpr_32 = IMPLICIT_DEF
>     %3:vgpr_lo16 = COPY %0.lo16
>     %2:vgpr_lo16 = COPY %3
>     S_ENDPGM 0, implicit %3, implicit %2
> ```
That is what I tried to explain. To reach the place we're talking about the copy should spawn SALU chain long enough to be selected for v_readfirstlane_b32. We have no SALU instructions that accept 16bit operand explicitly. All 16bit SALU really take 32bit SGPR but only use HI/LO half of it. So, we cannot create the MIR for the case you are concerned with.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128252/new/

https://reviews.llvm.org/D128252