<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - _mm512_mask_add_epi64 often fails to produce masked vpaddq"
href="https://bugs.llvm.org/show_bug.cgi?id=50358">50358</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>_mm512_mask_add_epi64 often fails to produce masked vpaddq
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>elrodc@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com
</td>
</tr></table>
<p>
<div>
<pre>Here are a couple examples:
<a href="https://godbolt.org/z/PM4Y3Wr14">https://godbolt.org/z/PM4Y3Wr14</a>
I encountered this when trying to optimize a mean function that skips `NaN`s.
Skipping some elements means I have to count the total number of non-`NaN`s.
On my Cascadelake desktop, using doubles to count non-NaNs took 62 ns, with
inner loop assembly like:
L128:
vmovupd zmm10, zmmword ptr [rdx + 8*rax]
vmovupd zmm11, zmmword ptr [rdx + 8*rax + 64]
vmovupd zmm12, zmmword ptr [rdx + 8*rax + 128]
vmovupd zmm13, zmmword ptr [rdx + 8*rax + 192]
vcmpordpd k3, zmm10, zmm0
vcmpordpd k4, zmm11, zmm0
vcmpordpd k2, zmm12, zmm0
vcmpordpd k1, zmm13, zmm0
vaddpd zmm9 {k3}, zmm9, zmm1
vaddpd zmm8 {k4}, zmm8, zmm1
vaddpd zmm7 {k2}, zmm7, zmm1
vaddpd zmm6 {k1}, zmm6, zmm1
vaddpd zmm5 {k3}, zmm5, zmm10
vaddpd zmm4 {k4}, zmm4, zmm11
vaddpd zmm2 {k2}, zmm2, zmm12
vaddpd zmm3 {k1}, zmm3, zmm13
add rax, 32
cmp rsi, rax
jne L128
Benchmarking 1024 elements took about 53ns on average.
However, there is a problem: while cascadelake (my desktop) can execute
`vaddpd` on both ports 0 and 5, many AVX512 systems, such as icelake-client,
tigerlake, and presumably rocketlake can only do so on port 0. They can,
however, execute `vpaddq` on port 5.
<a href="https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html">https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html</a>
<a href="https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html">https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html</a>
Meaning they should theoretically be able to get the same/nearly the same
throughput if I switch to counting the number of non-nans with integers.
But if I simply swap floating point for integers, I get a performance
regression to an average of about 65ns, because LLVM generates code like in the
second godbolt example (vpmovm2q to convert the mask into -1/0s, then vpsubq to
subtract):
L112:
vmovupd zmm9, zmmword ptr [rax + 8*rdi]
vmovupd zmm10, zmmword ptr [rax + 8*rdi + 64]
vmovupd zmm11, zmmword ptr [rax + 8*rdi + 128]
vmovupd zmm12, zmmword ptr [rax + 8*rdi + 192]
vcmpordpd k3, zmm9, zmm0
vcmpordpd k4, zmm10, zmm0
vcmpordpd k2, zmm11, zmm0
vcmpordpd k1, zmm12, zmm0
vpmovm2q zmm13, k3
vpsubq zmm8, zmm8, zmm13
vpmovm2q zmm13, k4
vpsubq zmm7, zmm7, zmm13
vpmovm2q zmm13, k2
vpsubq zmm6, zmm6, zmm13
vpmovm2q zmm13, k1
vpsubq zmm2, zmm2, zmm13
vaddpd zmm5 {k3}, zmm5, zmm9
vaddpd zmm4 {k4}, zmm4, zmm10
vaddpd zmm1 {k2}, zmm1, zmm11
vaddpd zmm3 {k1}, zmm3, zmm12
add rdi, 32
cmp rdx, rdi
jne L112
instead of using `vpaddq`.
Same basic result using the intrinsic `@llvm.vp.add.v8i64`:
<a href="https://godbolt.org/z/sKraG1vcx">https://godbolt.org/z/sKraG1vcx</a></pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>