[llvm] [AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (PR #127212)

Nicolai Hähnle via llvm-commits llvm-commits at lists.llvm.org
Fri Feb 14 08:10:07 PST 2025


https://github.com/nhaehnle commented:

Thanks, this is a good first cut! However, I suspect that we can do better.

Merely skipping this merge doesn't exploit all the information we have. Do we have a test like this (probably best as a .mir test):
```
v_mul_f32 v1, v1, v1
v_cmp  s0, ...
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1   ; no delay alu needed
```
In this case, no delay ALU is needed because the automatic wait for the v_cmp implies that the first v_mul is also done.

On the other hand:
```
v_cmp  s0, ...
v_mul_f32 v1, v1, v1
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1   ; delay alu needed here
```
Here, the automatic wait only waits for the SGPR write and not for the first v_mul, so we still want a delay_alu.

On the third hand:
```
v_cmp  s0, ...
v_mul_f32 v1, v1, v1
v_cmp  s2, ...
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1   ; delay alu NOT needed here
```
In this case, even though the S_OR only depends on the first v_cmp, it waits for *all* SGPR writes to complete, including the one to s2. So it implicitly waits for completion of the first v_mul, and so we don't need a delay_alu here.

So there are a bunch of additional cases to consider. I suggest you look into writing .mir test cases for them that only run the insert delay ALU pass (and check if perhaps similar tests already exist), and take another look at the DelayState and DelayInfo data structures to see how we can handle these cases best.

https://github.com/llvm/llvm-project/pull/127212


More information about the llvm-commits mailing list