<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - sse4.1 all/any reductions on <4 x i32> inconsistent and possibly suboptimal"
href="https://bugs.llvm.org/show_bug.cgi?id=38788">38788</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>sse4.1 all/any reductions on <4 x i32> inconsistent and possibly suboptimal
</td>
</tr>
<tr>
<th>Product</th>
<td>new-bugs
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>All
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>new bugs
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>gonzalobg88@gmail.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>This is <a href="https://github.com/rust-lang-nursery/packed_simd/issues/103">https://github.com/rust-lang-nursery/packed_simd/issues/103</a>
When the following functions:
define zeroext i1 @all_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0, align 16
%2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
%3 = icmp eq i32 %2, 65535
ret i1 %3
}
define zeroext i1 @any_sse2(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <16 x i8>*
%1 = load <16 x i8>, <16 x i8>* %0, align 16
%2 = tail call i32 @llvm.x86.sse2.pmovmskb.128(<16 x i8> %1) #2
%3 = icmp ne i32 %2, 0
ret i1 %3
}
define zeroext i1 @all_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <2 x i64>*
%1 = load <2 x i64>, <2 x i64>* %0, align 16
%2 = tail call i32 @llvm.x86.sse41.ptestc(<2 x i64> %1, <2 x i64> <i64 -1,
i64 -1>) #2
%3 = icmp eq i32 %2, 1
ret i1 %3
}
define zeroext i1 @any_sse41(<4 x i32>* noalias nocapture readonly
dereferenceable(16) %x) unnamed_addr #0 {
%0 = bitcast <4 x i32>* %x to <2 x i64>*
%1 = load <2 x i64>, <2 x i64>* %0, align 16
%2 = tail call i32 @llvm.x86.sse41.ptestz(<2 x i64> %1, <2 x i64> %1) #2
%3 = icmp eq i32 %2, 0
ret i1 %3
}
are compiled and optimized with the SSE4.1 target-feature enabled, the
different all_ and any_ variants produce different machine code even though
they perform the exact same operation:
all_sse2:
movdqa xmm0, xmmword ptr [rdi]
pmovmskb eax, xmm0
cmp eax, 65535
sete al
ret
any_sse2:
movdqa xmm0, xmmword ptr [rdi]
pmovmskb eax, xmm0
test eax, eax
setne al
ret
all_sse41:
movdqa xmm0, xmmword ptr [rdi]
pcmpeqd xmm1, xmm1
ptest xmm0, xmm1
setb al
ret
any_sse41:
movdqa xmm0, xmmword ptr [rdi]
ptest xmm0, xmm0
setne al
ret
IACA reports (<a href="https://gist.github.com/gnzlbg/80d3139393615c18495b1dd7855fc787">https://gist.github.com/gnzlbg/80d3139393615c18495b1dd7855fc787</a>):
all SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
all SSE4.1 (ptest) -> Uops: 5, Throughput 1.24 Cycles
any SSE2 (movmsk) -> Uops: 4, Throughput 1.00 Cycles
any SSE4.1 (ptest) -> Uops: 4, Throughput 1.00 Cycles
And instruction wise, all functions have 5 instructions except any_sse41 which
contains 4 instructions.
So _maybe_ (I am not sure):
- the all_sse41 function should be optimized / lowered to use
@llvm.x86.sse2.pmovmskb.128 instead of @llvm.x86.sse41.ptestc
- the any_sse2 function should be optimized to use @llvm.x86.sse41.ptestz
instead of @llvm.x86.sse2.pmovmskb.128 when SSE4.1 is enabled to improve
code-size</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>