<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - vector-compare code does unnecessary widening/narrowing"

   href="https://bugs.llvm.org/show_bug.cgi?id=38916">38916</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>vector-compare code does unnecessary widening/narrowing

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>srj@google.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>The change in <a href="https://reviews.llvm.org/rL339875">https://reviews.llvm.org/rL339875</a> seems to have regressed the

quality of some vector-compare code generation (for x86 at least) in Halide.

Halide is attempting to generate code that is comparing two <8 x i8> vectors;

```

    vec<8 x i8> ones = {1,1,1,1,1,1,1,1};

    vec<8 x i8> twos = {2,2,2,2,2,2,2,2};

    vec<8 x i8> a = load_vec_a();

    vec<8 x i8> b = load_vec_b();

    // result should contain 1 for each byte that matches, 2 for each that does

not

    vec<8 x i8> result = (a == b) ? ones : twos;

```

The unoptimized LLVM IR we generate for the above is:

```

    %9 = bitcast i8* %load_vec_a to <8 x i8>*

    %10 = load <8 x i8>, <8 x i8>* %9

    %11 = bitcast i8* %load_vec_b to <8 x i8>*

    %12 = load <8 x i8>, <8 x i8>* %11

    %13 = shufflevector <8 x i8> %10, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

    %14 = shufflevector <8 x i8> %12, <8 x i8> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

    %15 = icmp eq <16 x i8> %13, %14

    %16 = shufflevector <16 x i1> %15, <16 x i1> undef, <8 x i32> <i32 0, i32

1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

    %17 = shufflevector <8 x i1> %16, <8 x i1> undef, <16 x i32> <i32 0, i32 1,

i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32

undef, i32 undef, i32 undef, i32 undef, i32 undef>

    %18 = select <16 x i1> %17, <16 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1,

i8 1, i8 1, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8

undef, i8 undef>, <16 x i8> <i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8 2, i8

undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>

    %19 = shufflevector <16 x i8> %18, <16 x i8> undef, <8 x i32> <i32 0, i32

1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>

    %20 = bitcast i8* %result to <8 x i8>*

    store <8 x i8> %19, <8 x i8>* %20

```

The expectation here is that for x86 (w/SSE4), we'll end up with x86 code

something like this:

```

    vmovq     load_vec_a, %xmm0

    vmovq     load_vec_b, %xmm1

    vpcmpeqb  %xmm1, %xmm0, %xmm0

    vpaddb    .LCPI0_0(%rip), %xmm0, %xmm0  ## LCPI0_0 =

<2,2,2,2,2,2,2,2,u,u,u,u,u,u,u,u>

    vmovq     %xmm0, result

```

But after <a href="https://reviews.llvm.org/rL339875">https://reviews.llvm.org/rL339875</a>, however, the IR above emits

something more like:

```

    vpmovzxbw   load_vec_a, %xmm0 

    vpmovzxbw   load_vec_b, %xmm1

    vpcmpeqw    %xmm1, %xmm0, %xmm0

    vpacksswb   %xmm0, %xmm0, %xmm0

    vpsllw      $7, %xmm0, %xmm0

    vpand       .LCPI0_0(%rip), %xmm0, %xmm0  ## LCPI0_0 =

<0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0>

    vpxor       %xmm1, %xmm1, %xmm1

    vpcmpgtb    %xmm0, %xmm1, %xmm0

    vpaddb      .LCPI0_1(%rip), %xmm0, %xmm0  ## LCPI0_1 =

<2,2,2,2,2,2,2,2,u,u,u,u,u,u,u,u>

    vmovq       %xmm0, result

```

Besides being twice as long, it's just odd -- why are we expanding the results

to 16 bit when the source, intermediate, and result are all 8 bit? (Note that

current top-of-tree produces slightly different output from this second

example, but the fundamental pathology of unnecessary-widening-and-narrowing is

still in place.)</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>