[llvm] [AMDGCN][SIWholeQuadMode] Handle case when SI_KILL_I1_TERMINATOR -1,0 is not the only terminator (PR #122922)
Juan Manuel Martinez CaamaƱo via llvm-commits
llvm-commits at lists.llvm.org
Wed Jan 29 05:49:54 PST 2025
================
@@ -3361,7 +3361,7 @@ define amdgpu_ps void @test_for_deactivating_lanes_in_wave32(ptr addrspace(6) in
; GFX9-W64-NEXT: s_buffer_load_dword s0, s[0:3], 0x0
; GFX9-W64-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-W64-NEXT: v_cmp_le_f32_e64 vcc, s0, 0
-; GFX9-W64-NEXT: s_andn2_b64 s[4:5], exec, vcc
----------------
jmmartinez wrote:
The behavior is the same in this case. The branch depends on scc and not on execz. Both s_andn2_b64 should change scc in the same way, regardless of the destination register. Both registers, s[4:5] in one case and exec and exec on the other, are not used anymore in the rest of the function (otherwise there would be a change in behavior).
"SI optimize exec mask pre-RA" has a threshold below which it searches for exec copy instructions to optimize. Once we remove the unconditional branch we reach stay under the threshold and it gets optimized.
Here are the details of the transformation that is going on:
`@test_for_deactivating_lanes_in_wave32` code remains pretty much the same until after "SI optimize exec mask pre-RA". Before this optimizations the generated code was (for each case):
```
%7:sreg_64 = COPY $exec
...
dead %7:sreg_64 = S_ANDN2_B64 %7:sreg_64, $vcc, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc
$exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc
S_BRANCH %bb.1
```
```
%7:sreg_64 = COPY $exec
...
dead %7:sreg_64 = S_ANDN2_B64 %7:sreg_64, $vcc, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc
$exec = S_ANDN2_B64_term $exec, $vcc, implicit-def $scc
```
During "SI optimize exec mask pre-RA" the following changes happen:
* In both cases, the last `S_AND_B64/_term` instruction gets detected as redundant (since its result is not used) and is removed.
* In both cases, the pass looks for redundant copies of `$exec`, but within a threshold of 10 instructions. Here is where we go above threshold in one case, and not in the other.
After "SI optimize exec mask pre-RA" the code becomes
```
%7:sreg_64 = COPY $exec
...
dead %7:sreg_64 = S_ANDN2_B64 %7:sreg_64, $vcc, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc
```
```
dead $exec = S_ANDN2_B64 $exec, $vcc, implicit-def $scc
SI_EARLY_TERMINATE_SCC0 implicit $exec, implicit $scc
```
https://github.com/llvm/llvm-project/pull/122922
More information about the llvm-commits
mailing list