[PATCH] D96336: [AMDGPU] Save VGPR of whole wave when spilling

Wed Feb 10 01:33:40 PST 2021

sebastian-ne added a comment.

> I am a little dubious about the whole approach.

Me too, I’m also not happy about needing inline assembly, so if you have an idea to improve some or all of that, I’m all ears.

> If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

Yes. If we knew that we need to spill an SGPR, we would just reserve a VGPR to spill the SGPR to and not spill to scratch at all. The problem is, we don’t know. (Matt plans to fix that by splitting register allocation into two phases, first allocating SGPRs, then VPGRs.)

> I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?

I fear it’s less rare than we want. We hit this bug in a not-so-big shader that was forced to run with high occupancy and this limited to 64 VGPRs. However, it should get rare once the register allocation always spills SGPRs to VGPRs.

> As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

That would make sense, but it feels like s_scratch instructions got removed in newer hardware.

> We currently unconditionally reserve one VGPR for SGPR spills.

Interesting, I missed that. As the VGPR is reserved in `SITargetLowering::finalizeLowering`, this is currently not done for GlobalISel?

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1094
+    Register ExecReg = isWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
+    unsigned ExecXorOpc = isWave32 ? AMDGPU::S_XOR_B32 : AMDGPU::S_XOR_B64;
+    for (unsigned I = 0; I < 2; I++) {
----------------
critson wrote:
> Any reason for XOR rather than NOT?
No reason, I’ll change that. Thanks for the note

================
Comment at: llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll:1115
+; GCN-NEXT:    buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
+; GCN-NEXT:    s_mov_b64 exec, s[0:1]
+; GCN-NEXT:    s_mov_b64 s[0:1], exec
----------------
critson wrote:
> These two instructions are not doing anything.
Right, I’m working on fixing that in a later patch, same as Jay’s optimization.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96336/new/

https://reviews.llvm.org/D96336