[LLVMbugs] [Bug 23582] New: The vectorization costs of SHL/SRL/SRA need to be adjusted to vectorize the loop in the testcase
bugzilla-daemon at llvm.org
bugzilla-daemon at llvm.org
Tue May 19 15:51:06 PDT 2015
https://llvm.org/bugs/show_bug.cgi?id=23582
Bug ID: 23582
Summary: The vectorization costs of SHL/SRL/SRA need to be
adjusted to vectorize the loop in the testcase
Product: libraries
Version: trunk
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: wmi at google.com
CC: aschwaighofer at apple.com, llvmbugs at cs.uiuc.edu,
nrotem at apple.com
Classification: Unclassified
Created attachment 14344
--> https://llvm.org/bugs/attachment.cgi?id=14344&action=edit
testcase 1.cc
For the testcase 1.cc attached, the kernel loop is not vectorized because the
vectorization costs of SHL/SRL/SRA are set to be a high value in SSE2CostTable
in X86TTIImpl::getArithmeticInstrCost.
{ ISD::SHL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 2*5 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 4*10 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 2*10 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 8*10 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 4*10 }, // Scalarized.
But x86 supports psllw/pslld/psllq, psrlw/psrld/psrlq, psraw/psrad, I don't
understand why it is needed to set those costs to be so high. I saw it was
related with
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130401/170439.html.
But no testcase I can get from there.
The kernel loop in 1.cc:
for (; k < f; k++) {
m = h[k].ival;
m += h[k + 1].ival;
i[k].ival -= (l + n * m) >> j;
}
~/workarea/llvm-r237614/build/bin/clang++ -std=c++11 -O2 -S 1.cc
The assembly:
.LBB0_19: # %for.body
movswl -2(%rdi), %ebp
movswl (%rdi), %ebx
addl %ebp, %ebx
imull %edx, %ebx
addl %esi, %ebx
sarl %cl, %ebx
movzwl (%rax), %ebp
subl %ebx, %ebp
movw %bp, (%rax)
addq $2, %rdi
addq $2, %rax
decl %r12d
jne .LBB0_19
For 1.cc, if we adjust SSE2CostTable to be:
{ ISD::SHL, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SHL, MVT::v4i32, 1 }, // We optimized this using mul.
{ ISD::SHL, MVT::v2i64, 1 }, // Scalarized.
{ ISD::SRL, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SRL, MVT::v4i32, 1 }, // Scalarized.
{ ISD::SRL, MVT::v2i64, 1 }, // Scalarized.
{ ISD::SRA, MVT::v8i16, 1 }, // Scalarized.
{ ISD::SRA, MVT::v4i32, 1 }, // Scalarized.
then the kernel loop in 1.cc can be vectorized very well. (For llvm after
r235455, it needs the patch here http://reviews.llvm.org/D9865 to generate the
good vectorization code)
.LBB0_19: # %vector.body
xorps %xmm3, %xmm3
movss %xmm2, %xmm3 # xmm3 = xmm2[0],xmm3[1,2,3]
movq -2(%rdx), %xmm4 # xmm4 = mem[0],zero
punpcklwd %xmm4, %xmm4 # xmm4 = xmm4[0,0,1,1,2,2,3,3]
psrad $16, %xmm4
movq (%rdx), %xmm5 # xmm5 = mem[0],zero
punpcklwd %xmm5, %xmm5 # xmm5 = xmm5[0,0,1,1,2,2,3,3]
psrad $16, %xmm5
paddd %xmm4, %xmm5
pshufd $245, %xmm5, %xmm4 # xmm4 = xmm5[1,1,3,3]
pmuludq %xmm0, %xmm5
pshufd $232, %xmm5, %xmm5 # xmm5 = xmm5[0,2,2,3]
pshufd $245, %xmm0, %xmm6 # xmm6 = xmm0[1,1,3,3]
pmuludq %xmm4, %xmm6
pshufd $232, %xmm6, %xmm4 # xmm4 = xmm6[0,2,2,3]
punpckldq %xmm4, %xmm5 # xmm5 =
xmm5[0],xmm4[0],xmm5[1],xmm4[1]
paddd %xmm1, %xmm5
psrad %xmm3, %xmm5
movq (%rsi), %xmm3 # xmm3 = mem[0],zero
punpcklwd %xmm7, %xmm3 # xmm3 =
xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]
psubw %xmm5, %xmm3
pshuflw $232, %xmm3, %xmm3 # xmm3 = xmm3[0,2,2,3,4,5,6,7]
pshufhw $232, %xmm3, %xmm3 # xmm3 = xmm3[0,1,2,3,4,6,6,7]
pshufd $232, %xmm3, %xmm3 # xmm3 = xmm3[0,2,2,3]
movq %xmm3, (%rsi)
addq $8, %rdx
addq $8, %rsi
addq $-4, %rdi
jne .LBB0_19
The adjustment above can improved one of our benchmarks by 4%.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150519/329731e1/attachment.html>
More information about the llvm-bugs
mailing list