<html>
<head>
<base href="https://llvm.org/bugs/" />
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW --- - The vectorization costs of SHL/SRL/SRA need to be adjusted to vectorize the loop in the testcase"
href="https://llvm.org/bugs/show_bug.cgi?id=23582">23582</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>The vectorization costs of SHL/SRL/SRA need to be adjusted to vectorize the loop in the testcase
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>normal
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Loop Optimizer
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>wmi@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>aschwaighofer@apple.com, llvmbugs@cs.uiuc.edu, nrotem@apple.com
</td>
</tr>
<tr>
<th>Classification</th>
<td>Unclassified
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=14344" name="attach_14344" title="testcase 1.cc">attachment 14344</a> <a href="attachment.cgi?id=14344&action=edit" title="testcase 1.cc">[details]</a></span>
testcase 1.cc
For the testcase 1.cc attached, the kernel loop is not vectorized because the
vectorization costs of SHL/SRL/SRA are set to be a high value in SSE2CostTable
in X86TTIImpl::getArithmeticInstrCost.
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.
But x86 supports psllw/pslld/psllq, psrlw/psrld/psrlq, psraw/psrad, I don't
understand why it is needed to set those costs to be so high. I saw it was
related with
<a href="http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130401/170439.html">http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130401/170439.html</a>.
But no testcase I can get from there.
The kernel loop in 1.cc:
for (; k < f; k++) {
m = h[k].ival;
m += h[k + 1].ival;
i[k].ival -= (l + n * m) >> j;
}
~/workarea/llvm-r237614/build/bin/clang++ -std=c++11 -O2 -S 1.cc
The assembly:
.LBB0_19: # %for.body
movswl -2(%rdi), %ebp
movswl (%rdi), %ebx
addl %ebp, %ebx
imull %edx, %ebx
addl %esi, %ebx
sarl %cl, %ebx
movzwl (%rax), %ebp
subl %ebx, %ebp
movw %bp, (%rax)
addq $2, %rdi
addq $2, %rax
decl %r12d
jne .LBB0_19
For 1.cc, if we adjust SSE2CostTable to be:
{ ISD::SHL, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 1 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 1 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 1 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 1 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 1 }, // Scalarized.
then the kernel loop in 1.cc can be vectorized very well. (For llvm after
r235455, it needs the patch here <a href="http://reviews.llvm.org/D9865">http://reviews.llvm.org/D9865</a> to generate the
good vectorization code)
.LBB0_19: # %vector.body
xorps %xmm3, %xmm3
movss %xmm2, %xmm3 # xmm3 = xmm2[0],xmm3[1,2,3]
movq -2(%rdx), %xmm4 # xmm4 = mem[0],zero
punpcklwd %xmm4, %xmm4 # xmm4 = xmm4[0,0,1,1,2,2,3,3]
psrad $16, %xmm4
movq (%rdx), %xmm5 # xmm5 = mem[0],zero
punpcklwd %xmm5, %xmm5 # xmm5 = xmm5[0,0,1,1,2,2,3,3]
psrad $16, %xmm5
paddd %xmm4, %xmm5
pshufd $245, %xmm5, %xmm4 # xmm4 = xmm5[1,1,3,3]
pmuludq %xmm0, %xmm5
pshufd $232, %xmm5, %xmm5 # xmm5 = xmm5[0,2,2,3]
pshufd $245, %xmm0, %xmm6 # xmm6 = xmm0[1,1,3,3]
pmuludq %xmm4, %xmm6
pshufd $232, %xmm6, %xmm4 # xmm4 = xmm6[0,2,2,3]
punpckldq %xmm4, %xmm5 # xmm5 =
xmm5[0],xmm4[0],xmm5[1],xmm4[1]
paddd %xmm1, %xmm5
psrad %xmm3, %xmm5
movq (%rsi), %xmm3 # xmm3 = mem[0],zero
punpcklwd %xmm7, %xmm3 # xmm3 =
xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]
psubw %xmm5, %xmm3
pshuflw $232, %xmm3, %xmm3 # xmm3 = xmm3[0,2,2,3,4,5,6,7]
pshufhw $232, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2,3,4,6,6,7]
pshufd $232, %xmm3, %xmm3 # xmm3 = xmm3[0,2,2,3]
movq %xmm3, (%rsi)
addq $8, %rdx
addq $8, %rsi
addq $-4, %rdi
jne .LBB0_19
The adjustment above can improved one of our benchmarks by 4%.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>