[LLVMbugs] [Bug 8429] New: vectorized udiv/urem with constant pot-divisor are scalarized

Thu Oct 21 08:34:42 PDT 2010

http://llvm.org/bugs/show_bug.cgi?id=8429

           Summary: vectorized udiv/urem with constant pot-divisor are
                    scalarized
           Product: libraries
           Version: 2.8
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
        AssignedTo: unassignedbugs at nondot.org
        ReportedBy: sroland at vmware.com
                CC: llvmbugs at cs.uiuc.edu

Consider this function:

define <4 x i32> @udiv_vec(<4 x i32> %var) {
entry:
%0 = udiv <4 x i32> %var, <i32 16, i32 16, i32 16, i32 16>
ret <4 x i32> %0
}

llvm 2.8 produces this on x86_64 (and sse41 - with only sse2 it gets worse due
to the lack of pextrd):
        pextrd  $1, %xmm0, %eax
        shrl    $4, %eax
        movd    %xmm0, %ecx
        shrl    $4, %ecx
        movd    %ecx, %xmm1
        pinsrd  $1, %eax, %xmm1
        pextrd  $2, %xmm0, %eax
        shrl    $4, %eax
        pinsrd  $2, %eax, %xmm1
        pextrd  $3, %xmm0, %eax
        shrl    $4, %eax
        movdqa  %xmm1, %xmm0
        pinsrd  $3, %eax, %xmm0
        ret

But, if the divisor is not only a power of two, but the same for all 4 values,
as is the case here, obviously this would be preferred:

        psrld   $4, %xmm0
        ret

The same applies to urem (though this one also would require loading the mask
constant to xmm). I guess the same applies to <8 x i16> values (though I did
not test that) and <16 x i8> - though due to the lack of byte shifts this would
require some more work, but in any case I think it would be much cheaper than
doing extract/shift/insert for each of the 16 elements individually...

-- 
Configure bugmail: http://llvm.org/bugs/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.