<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - _mm512_mask_add_epi64 often fails to produce masked vpaddq"

   href="https://bugs.llvm.org/show_bug.cgi?id=50358">50358</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>_mm512_mask_add_epi64 often fails to produce masked vpaddq

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>elrodc@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, pengfei.wang@intel.com, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Here are a couple examples:

<a href="https://godbolt.org/z/PM4Y3Wr14">https://godbolt.org/z/PM4Y3Wr14</a>

I encountered this when trying to optimize a mean function that skips `NaN`s.

Skipping some elements means I have to count the total number of non-`NaN`s.

On my Cascadelake desktop, using doubles to count non-NaNs took 62 ns, with

inner loop assembly like:

L128:

        vmovupd zmm10, zmmword ptr [rdx + 8*rax]

        vmovupd zmm11, zmmword ptr [rdx + 8*rax + 64]

        vmovupd zmm12, zmmword ptr [rdx + 8*rax + 128]

        vmovupd zmm13, zmmword ptr [rdx + 8*rax + 192]

        vcmpordpd       k3, zmm10, zmm0

        vcmpordpd       k4, zmm11, zmm0

        vcmpordpd       k2, zmm12, zmm0

        vcmpordpd       k1, zmm13, zmm0

        vaddpd  zmm9 {k3}, zmm9, zmm1

        vaddpd  zmm8 {k4}, zmm8, zmm1

        vaddpd  zmm7 {k2}, zmm7, zmm1

        vaddpd  zmm6 {k1}, zmm6, zmm1

        vaddpd  zmm5 {k3}, zmm5, zmm10

        vaddpd  zmm4 {k4}, zmm4, zmm11

        vaddpd  zmm2 {k2}, zmm2, zmm12

        vaddpd  zmm3 {k1}, zmm3, zmm13

        add     rax, 32

        cmp     rsi, rax

        jne     L128

Benchmarking 1024 elements took about 53ns on average.

However, there is a problem: while cascadelake (my desktop) can execute

`vaddpd` on both ports 0 and 5, many AVX512 systems, such as icelake-client,

tigerlake, and presumably rocketlake can only do so on port 0. They can,

however, execute `vpaddq` on port 5.

<a href="https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html">https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html</a>

<a href="https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html">https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html</a>

Meaning they should theoretically be able to get the same/nearly the same

throughput if I switch to counting the number of non-nans with integers.

But if I simply swap floating point for integers, I get a performance

regression to an average of about 65ns, because LLVM generates code like in the

second godbolt example (vpmovm2q to convert the mask into -1/0s, then vpsubq to

subtract):

L112:

        vmovupd zmm9, zmmword ptr [rax + 8*rdi]

        vmovupd zmm10, zmmword ptr [rax + 8*rdi + 64]

        vmovupd zmm11, zmmword ptr [rax + 8*rdi + 128]

        vmovupd zmm12, zmmword ptr [rax + 8*rdi + 192]

        vcmpordpd       k3, zmm9, zmm0

        vcmpordpd       k4, zmm10, zmm0

        vcmpordpd       k2, zmm11, zmm0

        vcmpordpd       k1, zmm12, zmm0

        vpmovm2q        zmm13, k3

        vpsubq  zmm8, zmm8, zmm13

        vpmovm2q        zmm13, k4

        vpsubq  zmm7, zmm7, zmm13

        vpmovm2q        zmm13, k2

        vpsubq  zmm6, zmm6, zmm13

        vpmovm2q        zmm13, k1

        vpsubq  zmm2, zmm2, zmm13

        vaddpd  zmm5 {k3}, zmm5, zmm9

        vaddpd  zmm4 {k4}, zmm4, zmm10

        vaddpd  zmm1 {k2}, zmm1, zmm11

        vaddpd  zmm3 {k1}, zmm3, zmm12

        add     rdi, 32

        cmp     rdx, rdi

        jne     L112

instead of using `vpaddq`.

Same basic result using the intrinsic `@llvm.vp.add.v8i64`:

<a href="https://godbolt.org/z/sKraG1vcx">https://godbolt.org/z/sKraG1vcx</a></pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>