[llvm-bugs] [Bug 50358] New: _mm512_mask_add_epi64 often fails to produce masked vpaddq
via llvm-bugs
llvm-bugs at lists.llvm.org
Sat May 15 23:28:27 PDT 2021
https://bugs.llvm.org/show_bug.cgi?id=50358
Bug ID: 50358
Summary: _mm512_mask_add_epi64 often fails to produce masked
vpaddq
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: elrodc at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, pengfei.wang at intel.com,
spatel+llvm at rotateright.com
Here are a couple examples:
https://godbolt.org/z/PM4Y3Wr14
I encountered this when trying to optimize a mean function that skips `NaN`s.
Skipping some elements means I have to count the total number of non-`NaN`s.
On my Cascadelake desktop, using doubles to count non-NaNs took 62 ns, with
inner loop assembly like:
L128:
vmovupd zmm10, zmmword ptr [rdx + 8*rax]
vmovupd zmm11, zmmword ptr [rdx + 8*rax + 64]
vmovupd zmm12, zmmword ptr [rdx + 8*rax + 128]
vmovupd zmm13, zmmword ptr [rdx + 8*rax + 192]
vcmpordpd k3, zmm10, zmm0
vcmpordpd k4, zmm11, zmm0
vcmpordpd k2, zmm12, zmm0
vcmpordpd k1, zmm13, zmm0
vaddpd zmm9 {k3}, zmm9, zmm1
vaddpd zmm8 {k4}, zmm8, zmm1
vaddpd zmm7 {k2}, zmm7, zmm1
vaddpd zmm6 {k1}, zmm6, zmm1
vaddpd zmm5 {k3}, zmm5, zmm10
vaddpd zmm4 {k4}, zmm4, zmm11
vaddpd zmm2 {k2}, zmm2, zmm12
vaddpd zmm3 {k1}, zmm3, zmm13
add rax, 32
cmp rsi, rax
jne L128
Benchmarking 1024 elements took about 53ns on average.
However, there is a problem: while cascadelake (my desktop) can execute
`vaddpd` on both ports 0 and 5, many AVX512 systems, such as icelake-client,
tigerlake, and presumably rocketlake can only do so on port 0. They can,
however, execute `vpaddq` on port 5.
https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html
https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html
Meaning they should theoretically be able to get the same/nearly the same
throughput if I switch to counting the number of non-nans with integers.
But if I simply swap floating point for integers, I get a performance
regression to an average of about 65ns, because LLVM generates code like in the
second godbolt example (vpmovm2q to convert the mask into -1/0s, then vpsubq to
subtract):
L112:
vmovupd zmm9, zmmword ptr [rax + 8*rdi]
vmovupd zmm10, zmmword ptr [rax + 8*rdi + 64]
vmovupd zmm11, zmmword ptr [rax + 8*rdi + 128]
vmovupd zmm12, zmmword ptr [rax + 8*rdi + 192]
vcmpordpd k3, zmm9, zmm0
vcmpordpd k4, zmm10, zmm0
vcmpordpd k2, zmm11, zmm0
vcmpordpd k1, zmm12, zmm0
vpmovm2q zmm13, k3
vpsubq zmm8, zmm8, zmm13
vpmovm2q zmm13, k4
vpsubq zmm7, zmm7, zmm13
vpmovm2q zmm13, k2
vpsubq zmm6, zmm6, zmm13
vpmovm2q zmm13, k1
vpsubq zmm2, zmm2, zmm13
vaddpd zmm5 {k3}, zmm5, zmm9
vaddpd zmm4 {k4}, zmm4, zmm10
vaddpd zmm1 {k2}, zmm1, zmm11
vaddpd zmm3 {k1}, zmm3, zmm12
add rdi, 32
cmp rdx, rdi
jne L112
instead of using `vpaddq`.
Same basic result using the intrinsic `@llvm.vp.add.v8i64`:
https://godbolt.org/z/sKraG1vcx
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210516/ec192940/attachment-0001.html>
More information about the llvm-bugs
mailing list