[llvm-bugs] [Bug 50358] New: _mm512_mask_add_epi64 often fails to produce masked vpaddq

Sat May 15 23:28:27 PDT 2021

https://bugs.llvm.org/show_bug.cgi?id=50358

            Bug ID: 50358
           Summary: _mm512_mask_add_epi64 often fails to produce masked
                    vpaddq
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: elrodc at gmail.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, pengfei.wang at intel.com,
                    spatel+llvm at rotateright.com

Here are a couple examples:
https://godbolt.org/z/PM4Y3Wr14

I encountered this when trying to optimize a mean function that skips `NaN`s.
Skipping some elements means I have to count the total number of non-`NaN`s.
On my Cascadelake desktop, using doubles to count non-NaNs took 62 ns, with
inner loop assembly like:

L128:
        vmovupd zmm10, zmmword ptr [rdx + 8*rax]
        vmovupd zmm11, zmmword ptr [rdx + 8*rax + 64]
        vmovupd zmm12, zmmword ptr [rdx + 8*rax + 128]
        vmovupd zmm13, zmmword ptr [rdx + 8*rax + 192]
        vcmpordpd       k3, zmm10, zmm0
        vcmpordpd       k4, zmm11, zmm0
        vcmpordpd       k2, zmm12, zmm0
        vcmpordpd       k1, zmm13, zmm0
        vaddpd  zmm9 {k3}, zmm9, zmm1
        vaddpd  zmm8 {k4}, zmm8, zmm1
        vaddpd  zmm7 {k2}, zmm7, zmm1
        vaddpd  zmm6 {k1}, zmm6, zmm1
        vaddpd  zmm5 {k3}, zmm5, zmm10
        vaddpd  zmm4 {k4}, zmm4, zmm11
        vaddpd  zmm2 {k2}, zmm2, zmm12
        vaddpd  zmm3 {k1}, zmm3, zmm13
        add     rax, 32
        cmp     rsi, rax
        jne     L128

Benchmarking 1024 elements took about 53ns on average.
However, there is a problem: while cascadelake (my desktop) can execute
`vaddpd` on both ports 0 and 5, many AVX512 systems, such as icelake-client,
tigerlake, and presumably rocketlake can only do so on port 0. They can,
however, execute `vpaddq` on port 5.
https://uops.info/html-instr/VPADDQ_ZMM_K_ZMM_ZMM.html
https://uops.info/html-instr/VADDPD_ZMM_K_ZMM_ZMM.html

Meaning they should theoretically be able to get the same/nearly the same
throughput if I switch to counting the number of non-nans with integers.

But if I simply swap floating point for integers, I get a performance
regression to an average of about 65ns, because LLVM generates code like in the
second godbolt example (vpmovm2q to convert the mask into -1/0s, then vpsubq to
subtract):

L112:
        vmovupd zmm9, zmmword ptr [rax + 8*rdi]
        vmovupd zmm10, zmmword ptr [rax + 8*rdi + 64]
        vmovupd zmm11, zmmword ptr [rax + 8*rdi + 128]
        vmovupd zmm12, zmmword ptr [rax + 8*rdi + 192]
        vcmpordpd       k3, zmm9, zmm0
        vcmpordpd       k4, zmm10, zmm0
        vcmpordpd       k2, zmm11, zmm0
        vcmpordpd       k1, zmm12, zmm0
        vpmovm2q        zmm13, k3
        vpsubq  zmm8, zmm8, zmm13
        vpmovm2q        zmm13, k4
        vpsubq  zmm7, zmm7, zmm13
        vpmovm2q        zmm13, k2
        vpsubq  zmm6, zmm6, zmm13
        vpmovm2q        zmm13, k1
        vpsubq  zmm2, zmm2, zmm13
        vaddpd  zmm5 {k3}, zmm5, zmm9
        vaddpd  zmm4 {k4}, zmm4, zmm10
        vaddpd  zmm1 {k2}, zmm1, zmm11
        vaddpd  zmm3 {k1}, zmm3, zmm12
        add     rdi, 32
        cmp     rdx, rdi
        jne     L112

instead of using `vpaddq`.

Same basic result using the intrinsic `@llvm.vp.add.v8i64`:
https://godbolt.org/z/sKraG1vcx

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210516/ec192940/attachment-0001.html>