[PATCH] D96517: [AMDGPU] Optimize SGPR to scratch spilling

Fri Feb 19 01:25:02 PST 2021

sebastian-ne added a comment.

> Shall we optimize the cases where only 1 or 2 SGPRs are to be spilled or reloaded when there's a VGPR scavenged? In this case, we only need one or two loads/stores to spill/reload that SGPR.

The “v_mov and readfirstlane”-approach doesn’t work when exec=0.

However, you sparked an idea:
If we can scavenge an SGPR, we can use that to save the VGPR lanes that we clobber.

For example, we want to spill s0 to scratch and s5 is currently unused:

  v_readlane_b32 s5, v0, 0     ; Save v0
  v_writelane_b32 v0, s0, 0    ; Save s0 to v0 and to memory
  s_mov_b32 s0, exec
  s_mov_b32 exec, 1
  buffer_store_dword_offset v0, …
  s_mov_b32 exec, s0
  v_writelane_b32 v0, s5, 0    ; Restore v0

Restoring:

  v_readlane_b32 s5, v0, 0     ; Save v0
  v_writelane_b32 v0, s0, 0
  s_mov_b32 s0, exec
  s_mov_b32 exec, 1
  buffer_load_dword_offset v0, …  ; Read v0 from memory and into s0
  s_mov_b32 exec, s0
  v_readlane_b32 s0, v0, 0
  v_writelane_b32 v0, s5, 0    ; Restore v0

The downside is, it will make the code even more complicated. Especially restoring, as we need to ensure that exec is exactly 1, so we do not clobber other lanes. The above code would therefore only work in wave32 mode, not in wave64 mode. Except in the case where v0 is a scavenged register, i.e. it is unused in the currently active lanes, in which case we are allowed to clobber currently active lanes of v0, so the above code would also work in wave64 mode.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96517/new/

https://reviews.llvm.org/D96517