[llvm] [AMDGPU] Skip VGPR deallocation for waveslot limited kernels (PR #112765)

Fri Oct 18 02:16:49 PDT 2024

================
@@ -2606,15 +2606,24 @@ bool SIInsertWaitcnts::runOnMachineFunction(MachineFunction &MF) {
 
   // Insert DEALLOC_VGPR messages before previously identified S_ENDPGM
   // instructions.
-  for (MachineInstr *MI : ReleaseVGPRInsts) {
-    if (ST->requiresNopBeforeDeallocVGPRs()) {
-      BuildMI(*MI->getParent(), MI, MI->getDebugLoc(), TII->get(AMDGPU::S_NOP))
-          .addImm(0);
+  // Skip deallocation if kernel is waveslot limited vs VGPR limited. A short
+  // waveslot limited kernel runs slower with the deallocation.
+  if (!ReleaseVGPRInsts.empty() &&
+      (MF.getFrameInfo().hasCalls() ||
+       AMDGPU::IsaInfo::getTotalNumVGPRs(ST) /
+               TRI->getNumUsedPhysRegs(*MRI, AMDGPU::VGPR_32RegClass) <
----------------
jayfoad wrote:

Can this use `getNumWavesPerEUWithNumVGPRs`? That might be slightly more correct since it accounts for details like the VGPR allocation granule.

https://github.com/llvm/llvm-project/pull/112765