<html>

    <head>

      <base href="https://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - [X86][SSE] Failure to split vector loads for scalarized operations"

   href="https://llvm.org/bugs/show_bug.cgi?id=30986">30986</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[X86][SSE] Failure to split vector loads for scalarized operations

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>llvm-dev@redking.me.uk

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org, mkuper@google.com, spatel+llvm@rotateright.com, zvi.rackover@intel.com

          </td>

        </tr>

        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>As discussed on <a href="https://reviews.llvm.org/D26521">https://reviews.llvm.org/D26521</a>.

Scalarized cases fail to split any vector loads, resulting in a lot of extra

(potentially very slow) vector -> gpr traffic:

clang -S -O3 -march=btver2

#include <x86intrin.h>

__m128i popcnt1(__m128i *in) {

  return (__m128i) {

    __builtin_popcountll(in[0][0]),

    __builtin_popcountll(in[0][1]) };

}

popcnt1(long long __vector(2)*):

        vmovdqu (%rdi), %xmm0

        vmovq   %xmm0, %rax

        vpextrq $1, %xmm0, %rcx

        popcntq %rax, %rax

        popcntq %rcx, %rcx

        vmovq   %rcx, %xmm0

        vmovq   %rax, %xmm1

        vpunpcklqdq     %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0]

        retq

It would be better as:

popcnt1(long long __vector(2)*):

        popcntq (%rdi), %rax

        popcntq 8(%rdi), %rcx

        vmovq   %rcx, %xmm0

        vmovq   %rax, %xmm1

        vpunpcklqdq     %xmm0, %xmm1, %xmm0 # xmm0 = xmm1[0],xmm0[0]

        retq

Something similar happens when the source vector is spilled - it is first

restored to a vector register and then transferred to gprs. In some cases after

the results are transferred back to the vector register and then spilled again

- we could have spilled the scalar values in vector order directly...</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>