[LLVMbugs] [Bug 15524] New: vector truncation generates pretty terrible code without ssse3

Fri Mar 15 09:51:31 PDT 2013

http://llvm.org/bugs/show_bug.cgi?id=15524

            Bug ID: 15524
           Summary: vector truncation generates pretty terrible code
                    without ssse3
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: sroland at vmware.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

If pshufb (which requires ssse3) isn't available, "common" vector truncations
generate pretty terrible code, in particular generally doing element
extraction/inserts instead of using shuffles.
E.g. this:
define i64 @trunc(<4 x i32> %inval) {
entry:
  %0 = trunc <4 x i32> %inval to <4 x i16>
  %1 = bitcast <4 x i16> %0 to i64
  ret i64 %1
}

generates
        pextrw  $4, %xmm0, %ecx
        pextrw  $6, %xmm0, %eax
        movlhps %xmm0, %xmm0            # xmm0 = xmm0[0,0]
        pshuflw $8, %xmm0, %xmm0        # xmm0 = xmm0[0,2,0,0,4,5,6,7]
        pinsrw  $2, %ecx, %xmm0
        pinsrw  $3, %eax, %xmm0
        movd    %xmm0, %rax
        ret

(and don't ask me what the "movlhps" is even doing there as noone cares about
the upper 64bits). If ssse3 is available, this works ok (single pshufb
instruction).
However, there is really no need at all to go vector->scalar->vector, it can be
trivially done with 3 shuffles with only sse2:
       pshuflw $8, %xmm0, %xmm0
       pshufhw $8, %xmm0, %xmm0
       pshufd  $8, %xmm0, %xmm0
       movd    %xmm0, %rax

Even worse (WAY worse) is the same with 16bit->8bit:

define i64 @trunc(<8 x i16> %inval) {
entry:
  %0 = trunc <8 x i16> %inval to <8 x i8>
  %1 = bitcast <8 x i8> %0 to i64
  ret i64 %1
}

        pextrw  $3, %xmm0, %ecx
        shll    $8, %ecx
        pextrw  $2, %xmm0, %eax
        movzbl  %al, %eax
        orl     %ecx, %eax
        pextrw  $1, %xmm0, %ecx
        shll    $8, %ecx
        movd    %xmm0, %edx
        movzbl  %dl, %edx
        orl     %ecx, %edx
        movdqa  %xmm0, %xmm1
        pinsrw  $0, %edx, %xmm1
        pinsrw  $1, %eax, %xmm1
        pextrw  $5, %xmm0, %eax
        shll    $8, %eax
        pextrw  $4, %xmm0, %ecx
        movzbl  %cl, %ecx
        orl     %eax, %ecx
        pinsrw  $2, %ecx, %xmm1
        pextrw  $7, %xmm0, %eax
        shll    $8, %eax
        pextrw  $6, %xmm0, %ecx
        movzbl  %cl, %ecx
        orl     %eax, %ecx
        pinsrw  $3, %ecx, %xmm1
        movd    %xmm1, %rax
        ret

While we don't have byte shuffles here it could be emulated with and/shift/or
and then the same shuffle sequence as the 32bit->16bit case above.
However this is still too complicated, and an optimal version would just do
(obviously that's not real code but you get the idea):
       pand     %xmm0, <8 x 0x00ff>
       packuswb %xmm0, %xmm0 (second source can be anything)
       movd     %xmm0, %rax
(we can't use this trick for 32bit->16bit because we don't have unsigned pack
there without sse41)
That is probably at least an order of magnitude faster...
Granted it's only a problem if there's no ssse3 but fairly recent cpus don't
have that (e.g. amd barcelona).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20130315/8dac4f92/attachment.html>