<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [X86] Suboptimal lowering of lshr/ashr <4 x i32> for non-AVX2"
href="https://bugs.llvm.org/show_bug.cgi?id=37441">37441</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[X86] Suboptimal lowering of lshr/ashr <4 x i32> for non-AVX2
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>6.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: X86
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>fabiang@radgametools.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>target triple = "x86_64-unknown-linux-gnu"
define <4 x i32> @f(<4 x i32>, <4 x i32>) {
%a = lshr <4 x i32> %0, %1
ret <4 x i32> %a
}
produces:
f: # @f
.cfi_startproc
# %bb.0:
movaps %xmm1, %xmm2
psrlq $32, %xmm2
movaps %xmm0, %xmm3
psrld %xmm2, %xmm3
movaps %xmm1, %xmm2
psrldq $12, %xmm2 # xmm2 =
xmm2[12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
movaps %xmm0, %xmm4
psrld %xmm2, %xmm4
movsd %xmm3, %xmm4 # xmm4 = xmm3[0],xmm4[1]
pshufd $237, %xmm4, %xmm2 # xmm2 = xmm4[1,3,2,3]
xorps %xmm3, %xmm3
movaps %xmm1, %xmm4
punpckldq %xmm3, %xmm4 # xmm4 =
xmm4[0],xmm3[0],xmm4[1],xmm3[1]
movaps %xmm0, %xmm5
psrld %xmm4, %xmm5
punpckhdq %xmm3, %xmm1 # xmm1 =
xmm1[2],xmm3[2],xmm1[3],xmm3[3]
psrld %xmm1, %xmm0
movsd %xmm5, %xmm0 # xmm0 = xmm5[0],xmm0[1]
pshufd $232, %xmm0, %xmm0 # xmm0 = xmm0[0,2,2,3]
punpckldq %xmm2, %xmm0 # xmm0 =
xmm0[0],xmm2[0],xmm0[1],xmm2[1]
retq
The zero extension of the scalarized shift amounts in LowerShifts is "tighter"
than necessary, because for all of the SSE2 variable shift instructions, any
shift amount between 64 and UINT64_MAX produces the same result.
Proposal: bitcast to <8 x i16> and use a [ 0,1,1,1, -1,-1,-1,-1 ] shuffle for
the first shift, [ 2,3,3,3, -1,-1,-1,-1 ] for the second shift. These can be
lowered as PSHUFLW, replacing both various other shuffle ops and one MOVAPS per
lane. If the high word of the 32-bit shift amounts is 0, this zero-extends; if
it's not, it results in a (different) value than regular zero-extension would,
but this doesn't change the result it maps a shift amount >=64 into a different
(larger) shift amount >=64.
The sequence for the final merging that needs to pick lane 0 from R0, lane 1
from R1 and so on looks like it's targeting SSE4+ (with PBLENDW). More suitable
for pre-SSE4.1 should be this sequence of three <4 x i32> shuffles:
tmp0 = shufflevector(R0, R1, { 0, -1, -1, 5 }); // punpcklqdq
tmp1 = shufflevector(R2, R3, { 2, -1, -1, 7 }); // punpckhqdq
result = shufflevector(tmp0, tmp1, { 0, 3, 4, 7 }); // shufps
The entire sequence could be turned into something like
movaps xmm2, xmm0 ; copy of input
pshuflw xmm3, xmm1, 0x54 ; [0,1,1,1] shuffle for lane0 amt
psrld xmm0, xmm3 ; lane0 shift
movaps xmm3, xmm2 ; copy of input
pshuflw xmm4, xmm1, 0xfe ; [2,3,3,3] shuffle for lane1 amt
psrld xmm3, xmm4 ; lane1 shift
punpcklqdq xmm0, xmm3 ; [lane0, X, X, lane1]
movaps xmm3, xmm2 ; copy of input
pshufd xmm4, xmm1, 0xaa ; broadcast lane2 amt
psrldq xmm4, 12 ; lane2 amt (zero-extended)
psrld xmm3, xmm4 ; lane2 shift
psrldq xmm1, 12 ; lane3 amt (zero-extended)
psrld xmm2, xmm1 ; lane3 shift
punpckhqdq xmm3, xmm2 ; [lane2, X, X, lane3]
shufps xmm0, xmm3, 0xcc ; [lane0, lane1, lane2, lane3]
saving 5 instructions over the existing version (targeting SSE2), and using one
fewer temporary register.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>