[llvm] [AMDGPU] Schedule independent instructions between s_barrier_signal and s_barrier_wait (PR #172057)

Fri Dec 12 11:52:14 PST 2025

PrasoonMishra wrote:

Tuning the synthetic latency value:
Initially started with `latency=2000` but this was too aggressive. For example, in [CodeGen/AMDGPU/ds_read2-gfx1250.ll](https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/ds_read2-gfx1250.ll), it moved a load-dependent `v_add_f32` into the signal-wait gap, forcing an extra `s_wait_dscnt` and breaking the `v_dual_add_f32` dual-issue optimization.

```
; Original:
ds_load_2addr_b32 v[0:1], v4 offset1:8
s_barrier_signal -1
s_barrier_wait -1
ds_load_2addr_b32 v[2:3], v4 offset0:11 offset1:27
s_wait_dscnt 0x0
v_dual_add_f32 v0, v0, v1 :: v_dual_add_f32 v1, v2, v3 ;Single dual-issue instruction
v_add_f32_e32 v0, v0, v1

; With latency=2000:
ds_load_2addr_b32 v[0:1], v4 offset1:8
s_barrier_signal -1
s_wait_dscnt 0x0               ; extra s_wait_dscnt
v_add_f32_e32 v0, v0, v1
s_barrier_wait -1
ds_load_2addr_b32 v[2:3], v4 offset0:11 offset1:27
s_wait_dscnt 0x0
v_add_f32_e32 v1, v2, v3       ;Lost dual-issue
```
Through binary search, found that 16 is the sweet spot: high enough to schedule independent address calculations, but low enough to avoid pulling in memory-dependent instructions that would force extra waits and break dual-issue optimizations.

https://github.com/llvm/llvm-project/pull/172057