[llvm] [X86] Lower `minimum`/`maximum`/`minimumnum`/`maximumnum` using bitwise operations (PR #170069)
via llvm-commits
llvm-commits at lists.llvm.org
Mon Dec 1 00:40:58 PST 2025
================
@@ -208,12 +198,12 @@ define <32 x half> @test_fminimum_v32f16_szero(<32 x half> %x, <32 x half> %y) "
define <32 x half> @test_fmaximum_v32f16_nans_szero(<32 x half> %x, <32 x half> %y) {
; CHECK-LABEL: test_fmaximum_v32f16_nans_szero:
; CHECK: # %bb.0:
-; CHECK-NEXT: vpmovw2m %zmm0, %k1
-; CHECK-NEXT: vpblendmw %zmm1, %zmm0, %zmm2 {%k1}
+; CHECK-NEXT: vmaxph %zmm1, %zmm0, %zmm2
+; CHECK-NEXT: vpbroadcastw {{.*#+}} zmm1 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
+; CHECK-NEXT: vpternlogq {{.*#+}} zmm1 = zmm2 & (zmm1 | zmm0)
----------------
valadaptive wrote:
llvm-mca seems to think the new code is faster despite the memory load, at least in a tight loop. You can try it yourself with:
```asm
# LLVM-MCA-BEGIN old
vpmovw2m %zmm0, %k1
vpblendmw %zmm0, %zmm1, %zmm2 {%k1}
vmovdqu16 %zmm1, %zmm0 {%k1}
vminph %zmm2, %zmm0, %zmm0
# LLVM-MCA-END
# LLVM-MCA-BEGIN new
vminph %zmm1, %zmm0, %zmm1
vpbroadcastw (%rdi), %zmm2
vpternlogq $248, %zmm2, %zmm0, %zmm1
# LLVM-MCA-END
```
Here's what I get with `-mcpu=sapphirerapids`:
```
[0] Code Region - old
Iterations: 100
Instructions: 400
Total Cycles: 1103
Total uOps: 400
Dispatch Width: 6
uOps Per Cycle: 0.36
IPC: 0.36
Block RThroughput: 2.0
[snip]
[1] Code Region - new
Iterations: 100
Instructions: 300
Total Cycles: 607
Total uOps: 400
Dispatch Width: 6
uOps Per Cycle: 0.66
IPC: 0.49
Block RThroughput: 1.0
```
https://github.com/llvm/llvm-project/pull/170069
More information about the llvm-commits
mailing list