<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - vector truncation generates pretty terrible code without ssse3"

   href="http://llvm.org/bugs/show_bug.cgi?id=15524">15524</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>vector truncation generates pretty terrible code without ssse3

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>sroland@vmware.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvmbugs@cs.uiuc.edu

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>If pshufb (which requires ssse3) isn't available, "common" vector truncations

generate pretty terrible code, in particular generally doing element

extraction/inserts instead of using shuffles.

E.g. this:

define i64 @trunc(<4 x i32> %inval) {

entry:

  %0 = trunc <4 x i32> %inval to <4 x i16>

  %1 = bitcast <4 x i16> %0 to i64

  ret i64 %1

}

generates

        pextrw  $4, %xmm0, %ecx

        pextrw  $6, %xmm0, %eax

        movlhps %xmm0, %xmm0            # xmm0 = xmm0[0,0]

        pshuflw $8, %xmm0, %xmm0        # xmm0 = xmm0[0,2,0,0,4,5,6,7]

        pinsrw  $2, %ecx, %xmm0

        pinsrw  $3, %eax, %xmm0

        movd    %xmm0, %rax

        ret

(and don't ask me what the "movlhps" is even doing there as noone cares about

the upper 64bits). If ssse3 is available, this works ok (single pshufb

instruction).

However, there is really no need at all to go vector->scalar->vector, it can be

trivially done with 3 shuffles with only sse2:

       pshuflw $8, %xmm0, %xmm0

       pshufhw $8, %xmm0, %xmm0

       pshufd  $8, %xmm0, %xmm0

       movd    %xmm0, %rax

Even worse (WAY worse) is the same with 16bit->8bit:

define i64 @trunc(<8 x i16> %inval) {

entry:

  %0 = trunc <8 x i16> %inval to <8 x i8>

  %1 = bitcast <8 x i8> %0 to i64

  ret i64 %1

}

        pextrw  $3, %xmm0, %ecx

        shll    $8, %ecx

        pextrw  $2, %xmm0, %eax

        movzbl  %al, %eax

        orl     %ecx, %eax

        pextrw  $1, %xmm0, %ecx

        shll    $8, %ecx

        movd    %xmm0, %edx

        movzbl  %dl, %edx

        orl     %ecx, %edx

        movdqa  %xmm0, %xmm1

        pinsrw  $0, %edx, %xmm1

        pinsrw  $1, %eax, %xmm1

        pextrw  $5, %xmm0, %eax

        shll    $8, %eax

        pextrw  $4, %xmm0, %ecx

        movzbl  %cl, %ecx

        orl     %eax, %ecx

        pinsrw  $2, %ecx, %xmm1

        pextrw  $7, %xmm0, %eax

        shll    $8, %eax

        pextrw  $6, %xmm0, %ecx

        movzbl  %cl, %ecx

        orl     %eax, %ecx

        pinsrw  $3, %ecx, %xmm1

        movd    %xmm1, %rax

        ret

While we don't have byte shuffles here it could be emulated with and/shift/or

and then the same shuffle sequence as the 32bit->16bit case above.

However this is still too complicated, and an optimal version would just do

(obviously that's not real code but you get the idea):

       pand     %xmm0, <8 x 0x00ff>

       packuswb %xmm0, %xmm0 (second source can be anything)

       movd     %xmm0, %rax

(we can't use this trick for 32bit->16bit because we don't have unsigned pack

there without sse41)

That is probably at least an order of magnitude faster...

Granted it's only a problem if there's no ssse3 but fairly recent cpus don't

have that (e.g. amd barcelona).</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>