[llvm-bugs] [Bug 40148] New: Scalars are always chosen when vectorization cost is equal, even when already using vectors
via llvm-bugs
llvm-bugs at lists.llvm.org
Sun Dec 23 18:54:22 PST 2018
https://bugs.llvm.org/show_bug.cgi?id=40148
Bug ID: 40148
Summary: Scalars are always chosen when vectorization cost is
equal, even when already using vectors
Product: libraries
Version: trunk
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Backend: X86
Assignee: unassignedbugs at nondot.org
Reporter: husseydevin at gmail.com
CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
llvm-dev at redking.me.uk, spatel+llvm at rotateright.com
A cause of more i64x2 vector multiply bugs.
I think this is the higher scope of the bug, although I have a specific
example.
Basically, what I see is that LLVM will always choose scalar, even when the
cost model of vectorized is equal. While this is beneficial to prevent an
unwanted vectorization, if you are _already_ operating with vectors (most
notable with vector extensions), LLVM will forcibly choose scalar, resulting in
extraction.
Here, I am trying a few ways to get a pmuludq, as well as implementing
vmull_u32, all in vector extensions.
This is for trunk. 7.0 fails to vectorize pmuludq_v2, probably because of bug
40032.
U64x2 pmuludq_v1(U64x2 top, U64x2 bot)
{
return (top & 0xFFFFFFFF) * (bot & 0xFFFFFFFF);
}
U64x2 pmuludq_v2(U64x2 top, U64x2 bot)
{
return (U64x2) {
(top[0] & 0xFFFFFFFF) * (bot[0] & 0xFFFFFFFF),
(top[1] & 0xFFFFFFFF) * (bot[1] & 0xFFFFFFFF)
};
}
/* ARM-style */
U64x2 vmull_u32(U32x2 top, U32x2 bot)
{
return (U64x2) {
(U64)bot[0] * (U64)top[0],
(U64)bot[1] * (U64)top[1]
};
}
clang version 8.0.0 (trunk 350011)
clang -m32 -O3 -msse4.1
pmuludq_v1: # @pmuludq_v1
pmuludq xmm0, xmm1
ret
pmuludq_v2: # @pmuludq_v2
pmuludq xmm0, xmm1
ret
vmull_u32: # @vmull_u32
pmovzxdq xmm1, qword ptr [esp + 4] # xmm1 = mem[0],zero,mem[1],zero
pmovzxdq xmm0, qword ptr [esp + 12] # xmm0 = mem[0],zero,mem[1],zero
pmuludq xmm0, xmm1
ret
clang -m64 -O3 -msse4.1
pmuludq_v1: # @pmuludq_v1
pmuludq xmm0, xmm1
ret
pmuludq_v2: # @pmuludq_v2
movq rax, xmm0
mov eax, eax
movq rcx, xmm1
mov ecx, ecx
imul rcx, rax
pextrq rax, xmm0, 1
mov eax, eax
pextrq rdx, xmm1, 1
mov edx, edx
imul rdx, rax
movq xmm1, rdx
movq xmm0, rcx
punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
ret
vmull_u32: # @vmull_u32
pextrd eax, xmm0, 1
movd ecx, xmm0
pextrd edx, xmm1, 1
movd esi, xmm1
imul rcx, rsi
imul rax, rdx
movq xmm1, rax
movq xmm0, rcx
punpcklqdq xmm0, xmm1 # xmm0 = xmm0[0],xmm1[0]
ret
Godbolt: https://godbolt.org/z/6tGRWn
In the first example, Clang generates the expected pmuludq instruction on both
x86_64 and x86.
The second also generates the expected pmuludq on x86, but goes scalar on
x86_64.
The third generates the expected pmovzxdq and pmuludq on x86, but again goes
scalar on x86_64.
According to -Rpass-missed=".*", unlike x86, we get this in each of the x86_64
scalar examples:
"List vectorization was possible but not beneficial with cost 0 >= 0."
Since LLVM sees that there is no difference between a pmuludq and scalar, it
chooses scalar.
If there is no difference in cost, LLVM should remain in either vector or
scalar, or at least consider the cost of extraction if it isn't already.
Even if they are the same speed after extraction, the vectorized version should
still be preferred because in the above example, the vectorized pmuludq_v1
takes up 5 bytes, while the scalar pmuludq_v2 is 55 bytes.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20181224/f607262e/attachment-0001.html>
More information about the llvm-bugs
mailing list