[llvm-bugs] [Bug 37087] New: vpmovmskb+cmp equivalent to vmovmskps+cmp should maybe lower to vmovmskps
via llvm-bugs
llvm-bugs at lists.llvm.org
Wed Apr 11 05:34:15 PDT 2018
https://bugs.llvm.org/show_bug.cgi?id=37087
Bug ID: 37087
Summary: vpmovmskb+cmp equivalent to vmovmskps+cmp should maybe
lower to vmovmskps
Product: new-bugs
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: gonzalobg88 at gmail.com
CC: llvm-bugs at lists.llvm.org
This snippet of code (see it live: https://godbolt.org/g/NuiGgc):
Generates:
wrong_instr: # @wrong_instr
vmovaps xmm0, xmmword ptr [rsi]
vcmpeqps xmm0, xmm0, xmmword ptr [rdi]
vpmovmskb eax, xmm0
cmp eax, 65535
sete al
ret
correct_instr: # @correct_instr
vmovaps xmm0, xmmword ptr [rsi]
vcmpeqps xmm0, xmm0, xmmword ptr [rdi]
vmovmskps eax, xmm0
cmp eax, 15
sete al
ret
Note how "wrong_istr" uses, as specified, pmovmskb. AFAICT both snippets are
semantically equivalent.
On broadwell and haswell these intrinsics have identical performance (from
Agner's tables):
PMOVMSKB r,v mops fused: 1 mops unfused: 1 ports: p0 latency: 3 throughput: 1
MOVMSKPS r32,x mops fused: 1 mops unfused: 1 ports: p0 latency: 3 throughput: 1
On skylake MOVMSKPS appears to be slightly better:
PMOVMSKB r,v mops fused: 1 mops unfused: 1 ports: p0 latency: 2-3 throughput:
1
MOVMSKPS r32,x mops fused: 1 mops unfused: 1 ports: p0 latency: 2 throughput:
1
Depending on the CPU, switching from operating on floating-point vectors to
operating on integer vectors might introduce extra latency in which case
movmskps would be preferable in this situation.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180411/e1521b36/attachment.html>
More information about the llvm-bugs
mailing list