[llvm-bugs] [Bug 38788] New: sse4.1 all/any reductions on <4 x i32> inconsistent and possibly suboptimal
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri Aug 31 02:39:39 PDT 2018
https://bugs.llvm.org/show_bug.cgi?id=38788
Bug ID: 38788
Summary: sse4.1 all/any reductions on <4 x i32> inconsistent
and possibly suboptimal
Product: new-bugs
Version: trunk
Hardware: PC
OS: All
Status: NEW
Severity: enhancement
Priority: P
Component: new bugs
Assignee: unassignedbugs at nondot.org
Reporter: gonzalobg88 at gmail.com
CC: llvm-bugs at lists.llvm.org
This is https://github.com/rust-lang-nursery/packed_simd/issues/103
When the following functions:
define zeroext i1 @all_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0, align 16
%2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
%3 = icmp eq i32 %2, 65535
ret i1 %3
}
define zeroext i1 @any_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0, align 16
%2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
%3 = icmp ne i32 %2, 0
ret i1 %3
}
define zeroext i1 @all_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <2 x i64>*
%1 = load <2 x i64>, <2 x i64>* %0, align 16
%2 = tail call i32 @llvm.x86.sse41.ptestc(<2 x i64> %1, <2 x i64> <i64 -1,
i64 -1>) #2
%3 = icmp eq i32 %2, 1
ret i1 %3
}
define zeroext i1 @any_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <2 x i64>*
%1 = load <2 x i64>, <2 x i64>* %0, align 16
%2 = tail call i32 @llvm.x86.sse41.ptestz(<2 x i64> %1, <2 x i64> %1) #2
%3 = icmp eq i32 %2, 0
ret i1 %3
}
are compiled and optimized with the SSE4.1 target-feature enabled, the
different all_ and any_ variants produce different machine code even though
they perform the exact same operation:
all_sse2:
movdqa xmm0, xmmword ptr [rdi]
pmovmskb eax, xmm0
cmp eax, 65535
sete al
ret
any_sse2:
movdqa xmm0, xmmword ptr [rdi]
pmovmskb eax, xmm0
test eax, eax
setne al
ret
all_sse41:
movdqa xmm0, xmmword ptr [rdi]
pcmpeqd xmm1, xmm1
ptest xmm0, xmm1
setb al
ret
any_sse41:
movdqa xmm0, xmmword ptr [rdi]
ptest xmm0, xmm0
setne al
ret
IACA reports (https://gist.github.com/gnzlbg/80d3139393615c18495b1dd7855fc787):
all SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
all SSE4.1 (ptest) -> Uops: 5, Throughput 1.24 Cycles
any SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
any SSE4.1 (ptest) -> Uops: 4, Throughput 1.00 Cycles
And instruction wise, all functions have 5 instructions except any_sse41 which
contains 4 instructions.
So _maybe_ (I am not sure):
- the all_sse41 function should be optimized / lowered to use
@llvm.x86.sse2.pmovmskb.128 instead of @llvm.x86.sse41.ptestc
- the any_sse2 function should be optimized to use @llvm.x86.sse41.ptestz
instead of @llvm.x86.sse2.pmovmskb.128 when SSE4.1 is enabled to improve
code-size
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180831/c7ba0230/attachment-0001.html>
More information about the llvm-bugs
mailing list