<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [X86] [BtVer2] 256-bit integer horizontal add idiom not fully expanded using PHADDD"

   href="https://bugs.llvm.org/show_bug.cgi?id=39936">39936</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[X86] [BtVer2] 256-bit integer horizontal add idiom not fully expanded using PHADDD

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>andrea.dibiagio@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>This is a spin off of <a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [X86] AVX2 should use an extract_subvector and phadd for the first step of a pairwise v8i32 addition reduction"

   href="show_bug.cgi?id=39921">bug 39921</a>.

The following code performs an integer reduction using operator ADD. The type

is __v8si, so it could be implemented in three steps of horizontal adds.

```

int foo(__v8si A) {

  __v8si Lo = __builtin_shufflevector(A, A, 0, 2, 4, 6, -1, -1, -1, -1);

  __v8si Hi = __builtin_shufflevector(A, A, 1, 3, 5, 7, -1, -1, -1, -1);

  __v8si Step = Lo + Hi;

  Lo = __builtin_shufflevector(Step, Step, 0, 2, -1, -1, -1, -1, -1, -1);

  Hi = __builtin_shufflevector(Step, Step, 1, 3, -1, -1, -1, -1, -1, -1);

  Step = Lo + Hi;

  Hi = __builtin_shufflevector(Step, Step, 1, -1, -1, -1, -1, -1, -1, -1);

  Step += Hi;

  return Step[0];

}

```

Instead, on BtVer2, we currently generate this:

        vextractf128    $1, %ymm0, %xmm1

        vshufps $136, %xmm1, %xmm0, %xmm2 # xmm2 = xmm0[0,2],xmm1[0,2]

        vshufps $221, %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[1,3],xmm1[1,3]

        vpaddd  %xmm0, %xmm2, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vmovd   %xmm0, %eax

        retq

We could have generate this instead:

        vextractf128    $1, %ymm0, %xmm1

        vphaddd %xmm0, %xmm1, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vmovd   %xmm0, %eax

        retq

It looks like that our target specific `combineAnd()` routine is unable to

match the following DAG due to the presence of extract_subvector nodes.

      t6: v8i32 = vector_shuffle<0,2,4,6,u,u,u,u> t2, undef:v8i32

    t33: v4i32 = extract_subvector t6, Constant:i64<0>

      t7: v8i32 = vector_shuffle<1,3,5,7,u,u,u,u> t2, undef:v8i32

    t34: v4i32 = extract_subvector t7, Constant:i64<0>

  t35: v4i32 = add t33, t34

If we try to hoist the extract_subvector, instead of shrinking the binary

computation (and therefore shrink the shuffle operands), we end up in an

infinite loop of combine. That is because the DAGCombiner would always attempt

to sink a extract_subvector of a binop into the operands of the binop itself.

It is worth mentioning that if we compile the following (equivalent) C++ code,

then we get the optimal HADDD sequence:

```

int foo(__v8si A) {

  __v4si Lo = __builtin_shufflevector(A, A, 0, 2, 4, 6);

  __v4si Hi = __builtin_shufflevector(A, A, 1, 3, 5, 7);

  __v4si Step = Lo + Hi;

  Lo = __builtin_shufflevector(Step, Step, 0, 2, -1, -1);

  Hi = __builtin_shufflevector(Step, Step, 1, 3, -1, -1);

  Step = Lo + Hi;

  Hi = __builtin_shufflevector(Step, Step, 1, -1, -1, -1);

  Step += Hi;

  return Step[0];

}

```

        vextractf128    $1, %ymm0, %xmm1

        vphaddd %xmm1, %xmm0, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vphaddd %xmm0, %xmm0, %xmm0

        vmovd   %xmm0, %eax

        retq

Note how the shuffle mask is shrunk in the IR, so we only manipulate 128-bit

values in practice (excluding vector A in input to the function).

Does it mean that we can do something better at IR level, rather than

complicating existing target specific/independent dag combine rules?

Do we have a demanded-elts kind of analysis at IR level? If so, then we may be

able to realize that all those shufflevectors could be shrunk before we even

reach the code generator.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>