[llvm-bugs] [Bug 40148] New: Scalars are always chosen when vectorization cost is equal, even when already using vectors

Sun Dec 23 18:54:22 PST 2018

https://bugs.llvm.org/show_bug.cgi?id=40148

            Bug ID: 40148
           Summary: Scalars are always chosen when vectorization cost is
                    equal, even when already using vectors
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: husseydevin at gmail.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, spatel+llvm at rotateright.com

A cause of more i64x2 vector multiply bugs. 

I think this is the higher scope of the bug, although I have a specific
example. 

Basically, what I see is that LLVM will always choose scalar, even when the
cost model of vectorized is equal. While this is beneficial to prevent an
unwanted vectorization, if you are _already_ operating with vectors (most
notable with vector extensions), LLVM will forcibly choose scalar, resulting in
extraction.

Here, I am trying a few ways to get a pmuludq, as well as implementing
vmull_u32, all in vector extensions. 

This is for trunk. 7.0 fails to vectorize pmuludq_v2, probably because of bug
40032. 

U64x2 pmuludq_v1(U64x2 top, U64x2 bot)
{
    return (top & 0xFFFFFFFF) * (bot & 0xFFFFFFFF);
}
U64x2 pmuludq_v2(U64x2 top, U64x2 bot)
{
    return (U64x2) {
        (top[0] & 0xFFFFFFFF) * (bot[0] & 0xFFFFFFFF),
        (top[1] & 0xFFFFFFFF) * (bot[1] & 0xFFFFFFFF)
    };
}
/* ARM-style */
U64x2 vmull_u32(U32x2 top, U32x2 bot)
{
    return (U64x2) {
        (U64)bot[0] * (U64)top[0],
        (U64)bot[1] * (U64)top[1]
    };
}

clang version 8.0.0 (trunk 350011)
clang -m32 -O3 -msse4.1

pmuludq_v1: # @pmuludq_v1
  pmuludq xmm0, xmm1
  ret
pmuludq_v2: # @pmuludq_v2
  pmuludq xmm0, xmm1
  ret
vmull_u32: # @vmull_u32
  pmovzxdq xmm1, qword ptr [esp + 4] # xmm1 = mem[0],zero,mem[1],zero
  pmovzxdq xmm0, qword ptr [esp + 12] # xmm0 = mem[0],zero,mem[1],zero
  pmuludq xmm0, xmm1
  ret

clang -m64 -O3 -msse4.1
pmuludq_v1: # @pmuludq_v1
  pmuludq xmm0, xmm1
  ret
pmuludq_v2: # @pmuludq_v2
  movq rax, xmm0
  mov eax, eax
  movq rcx, xmm1
  mov ecx, ecx
  imul rcx, rax
  pextrq rax, xmm0, 1
  mov eax, eax
  pextrq rdx, xmm1, 1
  mov edx, edx
  imul rdx, rax
  movq xmm1, rdx
  movq xmm0, rcx
  punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
  ret
vmull_u32: # @vmull_u32
  pextrd eax, xmm0, 1
  movd ecx, xmm0
  pextrd edx, xmm1, 1
  movd esi, xmm1
  imul rcx, rsi
  imul rax, rdx
  movq xmm1, rax
  movq xmm0, rcx
  punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
  ret

Godbolt: https://godbolt.org/z/6tGRWn

In the first example, Clang generates the expected pmuludq instruction on both
x86_64 and x86. 

The second also generates the expected pmuludq on x86, but goes scalar on
x86_64.

The third generates the expected pmovzxdq and pmuludq on x86, but again goes
scalar on x86_64.

According to -Rpass-missed=".*", unlike x86, we get this in each of the x86_64
scalar examples:

"List vectorization was possible but not beneficial with cost 0 >= 0."

Since LLVM sees that there is no difference between a pmuludq and scalar, it
chooses scalar. 

If there is no difference in cost, LLVM should remain in either vector or
scalar, or at least consider the cost of extraction if it isn't already.

Even if they are the same speed after extraction, the vectorized version should
still be preferred because in the above example, the vectorized pmuludq_v1
takes up 5 bytes, while the scalar pmuludq_v2 is 55 bytes.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181224/f607262e/attachment-0001.html>