[LLVMbugs] [Bug 23582] New: The vectorization costs of SHL/SRL/SRA need to be adjusted to vectorize the loop in the testcase

bugzilla-daemon at llvm.org bugzilla-daemon at llvm.org
Tue May 19 15:51:06 PDT 2015


https://llvm.org/bugs/show_bug.cgi?id=23582

            Bug ID: 23582
           Summary: The vectorization costs of SHL/SRL/SRA need to be
                    adjusted to vectorize the loop in the testcase
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: wmi at google.com
                CC: aschwaighofer at apple.com, llvmbugs at cs.uiuc.edu,
                    nrotem at apple.com
    Classification: Unclassified

Created attachment 14344
  --> https://llvm.org/bugs/attachment.cgi?id=14344&action=edit
testcase 1.cc

For the testcase 1.cc attached, the kernel loop is not vectorized because the
vectorization costs of SHL/SRL/SRA are set to be a high value in SSE2CostTable
in X86TTIImpl::getArithmeticInstrCost. 

    { ISD::SHL,  MVT::v8i16,  8*10 }, // Scalarized.
    { ISD::SHL,  MVT::v4i32,  2*5 }, // We optimized this using mul.
    { ISD::SHL,  MVT::v2i64,  2*10 }, // Scalarized.

    { ISD::SRL,  MVT::v8i16,  8*10 }, // Scalarized.
    { ISD::SRL,  MVT::v4i32,  4*10 }, // Scalarized.
    { ISD::SRL,  MVT::v2i64,  2*10 }, // Scalarized.

    { ISD::SRA,  MVT::v8i16,  8*10 }, // Scalarized.
    { ISD::SRA,  MVT::v4i32,  4*10 }, // Scalarized.

But x86 supports psllw/pslld/psllq, psrlw/psrld/psrlq, psraw/psrad, I don't
understand why it is needed to set those costs to be so high. I saw it was
related with
http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20130401/170439.html.
But no testcase I can get from there.

The kernel loop in 1.cc:
        for (; k < f; k++) {
          m = h[k].ival;
          m += h[k + 1].ival;
          i[k].ival -= (l + n * m) >> j;
        }

~/workarea/llvm-r237614/build/bin/clang++ -std=c++11 -O2 -S 1.cc
The assembly:
.LBB0_19:                               # %for.body
        movswl  -2(%rdi), %ebp
        movswl  (%rdi), %ebx
        addl    %ebp, %ebx
        imull   %edx, %ebx
        addl    %esi, %ebx
        sarl    %cl, %ebx
        movzwl  (%rax), %ebp
        subl    %ebx, %ebp
        movw    %bp, (%rax)
        addq    $2, %rdi
        addq    $2, %rax
        decl    %r12d
        jne     .LBB0_19

For 1.cc, if we adjust SSE2CostTable to be:
    { ISD::SHL,  MVT::v8i16,  1 }, // Scalarized.
    { ISD::SHL,  MVT::v4i32,  1 }, // We optimized this using mul.
    { ISD::SHL,  MVT::v2i64,  1 }, // Scalarized.

    { ISD::SRL,  MVT::v8i16,  1 }, // Scalarized.
    { ISD::SRL,  MVT::v4i32,  1 }, // Scalarized.
    { ISD::SRL,  MVT::v2i64,  1 }, // Scalarized.

    { ISD::SRA,  MVT::v8i16,  1 }, // Scalarized.
    { ISD::SRA,  MVT::v4i32,  1 }, // Scalarized.

then the kernel loop in 1.cc can be vectorized very well. (For llvm after
r235455, it needs the patch here http://reviews.llvm.org/D9865 to generate the
good vectorization code)

.LBB0_19:                               # %vector.body
        xorps   %xmm3, %xmm3
        movss   %xmm2, %xmm3            # xmm3 = xmm2[0],xmm3[1,2,3]
        movq    -2(%rdx), %xmm4         # xmm4 = mem[0],zero
        punpcklwd       %xmm4, %xmm4    # xmm4 = xmm4[0,0,1,1,2,2,3,3]
        psrad   $16, %xmm4
        movq    (%rdx), %xmm5           # xmm5 = mem[0],zero
        punpcklwd       %xmm5, %xmm5    # xmm5 = xmm5[0,0,1,1,2,2,3,3]
        psrad   $16, %xmm5
        paddd   %xmm4, %xmm5
        pshufd  $245, %xmm5, %xmm4      # xmm4 = xmm5[1,1,3,3]
        pmuludq %xmm0, %xmm5
        pshufd  $232, %xmm5, %xmm5      # xmm5 = xmm5[0,2,2,3]
        pshufd  $245, %xmm0, %xmm6      # xmm6 = xmm0[1,1,3,3]
        pmuludq %xmm4, %xmm6
        pshufd  $232, %xmm6, %xmm4      # xmm4 = xmm6[0,2,2,3]
        punpckldq       %xmm4, %xmm5    # xmm5 =
xmm5[0],xmm4[0],xmm5[1],xmm4[1]
        paddd   %xmm1, %xmm5
        psrad   %xmm3, %xmm5
        movq    (%rsi), %xmm3           # xmm3 = mem[0],zero
        punpcklwd       %xmm7, %xmm3    # xmm3 =
xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]
        psubw   %xmm5, %xmm3
        pshuflw $232, %xmm3, %xmm3      # xmm3 = xmm3[0,2,2,3,4,5,6,7]
        pshufhw $232, %xmm3, %xmm3      # xmm3 = xmm3[0,1,2,3,4,6,6,7]
        pshufd  $232, %xmm3, %xmm3      # xmm3 = xmm3[0,2,2,3]
        movq    %xmm3, (%rsi)
        addq    $8, %rdx
        addq    $8, %rsi
        addq    $-4, %rdi
        jne     .LBB0_19

The adjustment above can improved one of our benchmarks by 4%.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150519/329731e1/attachment.html>


More information about the llvm-bugs mailing list