[llvm] [AMDGPU] Add scheduling DAG mutation for hazard latencies (PR #170075)

Sun Dec 7 22:48:28 PST 2025

perlfu wrote:

> > Specifically this helps with the case of V_CMP output feeding V_CNDMASK instructions.
> 
> Can you explain more and give an example? V_CMP feeding V_CNDMASK is a fast-forward case so it should be fine to schedule them adjacent.

This issue is that in (tight) loops `V_CNDMASK` taints the SGPRs used for the mask so they require a `S_WAITCNT_DEPCTR`.
e.g.
```
MBB:
  ...
  $sgpr = V_CMP
  $vgpr = V_CNDMASK ..., $sgpr
  ...
  S_CBRANCH %MBB
```

So the fast-forward case you mention ends up requiring a VALU pipeline wait/stall.
This is particularly painful if there are multiple `V_CMP` to `V_CNDMASK` in the loop body.
With this mutation the scheduler is biased to perform SGPR writes (from VALUs), schedule other instructions, then schedule SGPR reads.
Typically this minimize the impact to a single `S_WAITCNT_DEPCTR` per-iteration with some latency hiding.

I could restrict this mutation to loops -- although I am sure if that analysis is available within the schedule as it works per-MBB.

https://github.com/llvm/llvm-project/pull/170075