[llvm] [AMDGPU] Do not count implicit VGPRs in SIInsertWaitcnts (PR #109049)

Thu Sep 19 11:49:37 PDT 2024

================
@@ -1752,6 +1752,15 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
         const bool IsVGPR = TRI->isVectorRegister(*MRI, Op.getReg());
         for (int RegNo = Interval.first; RegNo < Interval.second; ++RegNo) {
           if (IsVGPR) {
+            // Implicit VGPR defs and uses are never a part of the memory
+            // instructions description and usually present to account for
+            // super-register liveness. Tied implicit sources on loads though
+            // are real uses.
+            // TODO: Most of the other instructions also have implicit uses
+            // for the liveness accounting only.
+            if (Op.isImplicit() && MI.mayLoadOrStore() && !Op.isTied())
----------------
rampitec wrote:

The failed test was image-waterfall-loop-O0.ll, this wait was missing:
```
   v_mov_b32_e32 v3, s4
   ; kill: killed $vgpr4
   s_xor_saveexec_b32 s4, -1
   s_waitcnt vmcnt(0)
   buffer_load_dword v0, off, s[0:3], s32 offset:80 ; 4-byte Folded Reload
   buffer_load_dword v2, off, s[0:3], s32 offset:84 ; 4-byte Folded Reload
```
That is how pass debug log looks if I remove the tied check:
```
    VM_CNT(2): 1:v0 0:v4
    LGKM_CNT(0):
    EXP_CNT(0):
    VS_CNT(86):

$vgpr0 = BUFFER_LOAD_DWORD_OFFSET $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr32, 80, 0, 0, implicit $exec, implicit $vgpr0(tied-def 0) :: (load (s32) from %stack.16, addrspace 5)
```
So it reads v0 and merges the load back. This may be not needed for a dword load, but what if we read 16-bit and preserve the other half? The pattern will be the same, a tied def.

https://github.com/llvm/llvm-project/pull/109049