[PATCH] D96336: [AMDGPU] Save VGPR of whole wave when spilling

Tue Feb 9 18:29:21 PST 2021

critson added a comment.

I am a little dubious about the whole approach.

If every SGPR spill that goes to scratch has to do an extra store+load (or multiple) then is that not potentially worse than the performance hit of reserving an entire VGPR for spilling in the case that we know we are going to have to use one? (I guess perhaps we have no way of knowing we need one?)

I get that this is basically an edge case (and we want to avoid SGPR spill to memory in the long run through other changes), but I wonder if we can qualify/quantify how rare this edge case is?
If it is truly rare, then I guess it matter a lot less how performant the resulting code is.

As an aside, if we are moving to using flat scratch in the main, is it possible to replace most of this with s_scratch_store / s_scratch_load and avoid the need for an VGPR entirely?

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1081
+    // Save exec and activate all lanes
+    BuildMI(*MBB, MI, DL, TII->get(ExecOrOpc), SavedExecReg).addImm(-1);
+
----------------
piotr wrote:
> Not sure if that's a concern in this context, but doesn't it potentially clobber SCC?
Clobbering SCC is an ongoing concern with spill code; however, buildSpillLoadStore can already generate arithmetic instructions which can clobber SCC, so this is not a new concern.
It's possible that if there is an issue we are going to run into it faster as this will clobber SCC everytime.

================
Comment at: llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp:1094
+    Register ExecReg = isWave32 ? AMDGPU::EXEC_LO : AMDGPU::EXEC;
+    unsigned ExecXorOpc = isWave32 ? AMDGPU::S_XOR_B32 : AMDGPU::S_XOR_B64;
+    for (unsigned I = 0; I < 2; I++) {
----------------
Any reason for XOR rather than NOT?

================
Comment at: llvm/test/CodeGen/AMDGPU/partial-sgpr-to-vgpr-spills.ll:1115
+; GCN-NEXT:    buffer_store_dword v0, off, s[52:55], 0 ; 4-byte Folded Spill
+; GCN-NEXT:    s_mov_b64 exec, s[0:1]
+; GCN-NEXT:    s_mov_b64 s[0:1], exec
----------------
These two instructions are not doing anything.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96336/new/

https://reviews.llvm.org/D96336