<html>

    <head>

      <base href="http://llvm.org/bugs/" />

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW --- - [SLP vectorization] scalar loads not combined into vector load"

   href="http://llvm.org/bugs/show_bug.cgi?id=19657">19657</a>

          </td>

        </tr>


        <tr>

          <th>Summary</th>

          <td>[SLP vectorization] scalar loads not combined into vector load

          </td>

        </tr>


        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>


        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>


        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>


        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>


        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>


        <tr>

          <th>Severity</th>

          <td>normal

          </td>

        </tr>


        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>


        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>


        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>


        <tr>

          <th>Reporter</th>

          <td>spatel+llvm@rotateright.com

          </td>

        </tr>


        <tr>

          <th>CC</th>

          <td>llvmbugs@cs.uiuc.edu

          </td>

        </tr>


        <tr>

          <th>Classification</th>

          <td>Unclassified

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Using:

$ ./clang -v 

clang version 3.5.0 (207996)

Target: x86_64-apple-darwin13.1.0

Thread model: posix


On the following C code that multiplies and adds 4 double-precision array

elements:

$ cat vmul.c

void foo(double *x) {

    x[0] = x[0] * x[0] + x[0];

    x[1] = x[1] * x[1] + x[1];

    x[2] = x[2] * x[2] + x[2];

    x[3] = x[3] * x[3] + x[3];

}


The generated code contains scalar loads for x[2] and x[3], but all other

operations are vectorized. It would be better in size and speed to optimize the

scalar loads into a single vector load:


$ ./clang -O2 -S -o - vmul.c -march=btver2

...

    vmovupd    (%rdi), %xmm0               <---- vector load: good

    vmulpd    %xmm0, %xmm0, %xmm1

    vaddpd    %xmm1, %xmm0, %xmm0

    vmovupd    %xmm0, (%rdi)

    vmovsd    16(%rdi), %xmm0             <---- scalar load: bad

    vmovhpd    24(%rdi), %xmm0, %xmm0      <---- scalar load: bad

    vmulpd    %xmm0, %xmm0, %xmm1

    vaddpd    %xmm1, %xmm0, %xmm0

    vmovupd    %xmm0, 16(%rdi)

    popq    %rbp

    retq</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      
      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>