<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - Terrible shuffle lowering for zip of two i8 values (all backends)"

   href="https://llvm.org/bugs/show_bug.cgi?id=31301">31301</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Terrible shuffle lowering for zip of two i8 values (all backends)

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: ARM

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>efriedma@codeaurora.org

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>C testcase for ARM:

#include <arm_neon.h>

uint8x8_t f(char* x, char *y)

{

  return vzip_u8(vld1_dup_u8(x), vld1_dup_u8(y)).val[0];

}

IR testcase:

define <8 x i8> @vdup_zip(i8* nocapture readonly %x, i8* nocapture readonly %y)

 {

entry:

  %0 = load i8, i8* %x, align 1

  %1 = insertelement <8 x i8> undef, i8 %0, i32 0

  %lane = shufflevector <8 x i8> %1, <8 x i8> undef, <8 x i32> <i32 0, i32 0,

i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>

  %2 = load i8, i8* %y, align 1

  %3 = insertelement <8 x i8> undef, i8 %2, i32 0

  %lane3 = shufflevector <8 x i8> %3, <8 x i8> undef, <8 x i32> <i32 0, i32 0,

i32 0, i32 0, i32 undef, i32 undef, i32 undef, i32 undef>

  %vzip.i = shufflevector <8 x i8> %lane, <8 x i8> %lane3, <8 x i32> <i32 0,

i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>

  ret <8 x i8> %vzip.i

}

IR looks fine.  CodeGen gives:

        ldrb    r0, [r0]

        ldrb    r1, [r1]

        vmov.8  d16[0], r0

        vmov.8  d16[1], r1

        vmov.8  d16[2], r0

        vmov.8  d16[3], r1

        vmov.8  d16[4], r0

        vmov.8  d16[5], r1

        vmov.8  d16[6], r0

        vmov.8  d16[7], r1

        vmov    r0, r1, d16

        bx      lr

i.e. we've managed to blow up a simple three-instruction NEON sequence into ten

instructions.

Slight variant for testing on architectures which have `16 x i8`, not `8 x i8`:

define <16 x i8> @vdup_zip(i8* nocapture readonly %x, i8* nocapture readonly

%y)  {

entry:

  %0 = load i8, i8* %x, align 1

  %1 = insertelement <16 x i8> undef, i8 %0, i32 0

  %lane = shufflevector <16 x i8> %1, <16 x i8> undef, <16 x i32> <i32 0, i32

0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef,

i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %2 = load i8, i8* %y, align 1

  %3 = insertelement <16 x i8> undef, i8 %2, i32 0

  %lane3 = shufflevector <16 x i8> %3, <16 x i8> undef, <16 x i32> <i32 0, i32

0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 undef, i32 undef,

i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>

  %vzip.i = shufflevector <16 x i8> %lane, <16 x i8> %lane3, <16 x i32> <i32 0,

i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32

21, i32 6, i32 22, i32 7, i32 23>

  ret <16 x i8> %vzip.i

}

It looks like DAGCombine turns the IR into a BUILD_VECTOR, and the ARM backend

can't recover the shape.  Actually, it looks like every backend fails to

produce the obvious lowering; aarch64 generates a sequence of ins instructions,

x86 generates a bunch of vpinsrb instructions, systemz generates a sequence of

vlvgb.  powerpc manages to at least generate a shuffle, but it generates two

extra instructions because it doesn't manage to pick the right shuffle.

I'm not exactly sure what the right solution looks like here; maybe we can do

something more helpful on a target-independent level than just throwing away

the shuffles and creating a BUILD_VECTOR?</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>