[llvm] [AMDGPU][InsertWaitCnts] Optimize loadcnt insertion at function boundaries (PR #169647)

Wed Nov 26 06:12:33 PST 2025

================
@@ -715,6 +715,22 @@ class WaitcntBrackets {
     PendingEvents |= Context->WaitEventMaskForInst[STORE_CNT];
   }
 
+  // Returns true if any VGPR has a pending load (score > lower bound for T).
+  // This is used to optimize waitcnt insertion at function boundaries when the
+  // only pending LOAD_CNT events are from instructions that don't write to
+  // VGPRs (e.g., GLOBAL_INV).
+  bool hasPendingVGPRWait(InstCounterType T) const {
+    unsigned LB = getScoreLB(T);
+    // If VgprUB is -1, no VGPRs have been touched
+    if (VgprUB < 0)
+      return false;
+    for (int RegNo = 0; RegNo <= VgprUB; ++RegNo) {
----------------
jayfoad wrote:

I suggest creating a new WaitEventType specifically for GLOBAL_INV and using it instead of VMEM_READ_ACCESS. Then you can have a fast check for PendingEvents here instead of a slow loop over all VGPRs.

https://github.com/llvm/llvm-project/pull/169647