[LLVMbugs] [Bug 11775] New: [AVX] opportunity for better code by transforming vec w/one element used to scalar

Mon Jan 16 16:36:10 PST 2012

http://llvm.org/bugs/show_bug.cgi?id=11775

             Bug #: 11775
           Summary: [AVX] opportunity for better code by transforming vec
                    w/one element used to scalar
           Product: new-bugs
           Version: trunk
          Platform: PC
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: matt at pharr.org
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

Created attachment 7887
  --> http://llvm.org/bugs/attachment.cgi?id=7887
examples

The attached test case has two versions of a loop over float values in memory,
where each time through the loop an 8-wide vector of floats is loaded and added
to an accumulated <8 x float> sum, which the function returns.

In the first version, foo(), the %iter_val42342 value is a vector that has
value <0,1,2,3,4,5,6,7> the first time through the loop, <8,9,10,...> the
second time through, and so forth.  As it turns out, this value is only used in
an extractelement instruction, the result of which is used to index into the
array of floats.

Here is the generated code for the loop body (with top of tree, llc
-mattr=+avx):

LBB0_1:                                 ## %foreach_full_body
                                        ## =>This Inner Loop Header: Depth=1
    addl    $8, %ecx
    vmovd    %ecx, %xmm3
    vinsertf128    $1, %xmm3, %ymm3, %ymm3
    vpermilps    $0, %ymm3, %ymm3 ## ymm3 = ymm3[0,0,0,0,4,4,4,4]
    vmovd    %xmm2, %edx
    shll    $2, %edx
    movslq    %edx, %rdx
    vmovups    (%rdi,%rdx), %ymm2
    vaddps    %ymm2, %ymm0, %ymm0
    vextractf128    $1, %ymm3, %xmm2
    vextractf128    $1, %ymm1, %xmm4
    vpaddd    %xmm4, %xmm2, %xmm2
    vpaddd    %xmm1, %xmm3, %xmm3
    cmpl    %eax, %ecx
    vinsertf128    $1, %xmm2, %ymm3, %ymm2
    jl    LBB0_1

The code is going through all of the work to maintain all of the vector values,
even though only one is needed (doubly-painful with AVX and only 4-wide integer
instructions.)  This is also inhibiting other optimizations.

In the bar() function in the attached, I've manually transformed this vector
into a scalar value.  The resulting code is much nicer.

LBB1_1:                                 ## %foreach_full_body
                                        ## =>This Inner Loop Header: Depth=1
    movslq    %ecx, %rcx
    vmovups    (%rdi,%rcx), %ymm1
    vaddps    %ymm1, %ymm0, %ymm0
    addl    $32, %ecx
    addl    $8, %edx
    cmpl    %eax, %edx
    jl    LBB1_1

This suggests that it might be worthwhile to look for computations on vectors
where only one of the elements is used, and to lower these down to the
corresponding scalar computation if possible.

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.