<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - vector load and store instructions (LD4, ST4) slow execution performance"
   href="https://bugs.llvm.org/show_bug.cgi?id=44655">44655</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>vector load and store instructions (LD4, ST4) slow execution performance
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>libraries
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>9.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Keywords</th>
          <td>performance
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>Backend: AArch64
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>sbiersdorff@nvidia.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>arnaud.degrandmaison@arm.com, llvm-bugs@lists.llvm.org, peter.smith@linaro.org, Ties.Stuij@arm.com
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Created <span class=""><a href="attachment.cgi?id=23061" name="attach_23061" title="LL file snippet">attachment 23061</a> <a href="attachment.cgi?id=23061&action=edit" title="LL file snippet">[details]</a></span>
LL file snippet

The following generated assembly takes twice as long to execute versus a
version that only load register in pairs (or one-by-one):

  1303 │220:   ld4    {v2.2d-v5.2d}, [x13], #64
  4888 │       ld4    {v16.2d-v19.2d}, [x14]
 20143 │       fmla   v16.2d, v2.2d, v1.2d
    68 │       fmla   v17.2d, v3.2d, v1.2d
  1071 │       fmla   v18.2d, v4.2d, v1.2d
   293 │       fmla   v19.2d, v5.2d, v1.2d
  4524 │       st4    {v16.2d-v19.2d}, [x14], #64
 15579 │       subs   x15, x15, #0x2
    11 │     ↑ b.ne   220

Much better is to load in pair of scalars (even though that results in more
instructions being executed):

   487 │234:   ldp    q2, q3, [x12, #32]
  1106 │       ldp    q4, q5, [x12], #64
  2694 │       ldp    q6, q7, [x13, #32]
  2898 │       ldp    q16, q17, [x13]
  3847 │       subs   x14, x14, #0x2
  5440 │       fmla   v6.2d, v2.2d, v1.2d
  1689 │       fmla   v16.2d, v4.2d, v1.2d
  3530 │       fmla   v17.2d, v5.2d, v1.2d
  1315 │       fmla   v7.2d, v3.2d, v1.2d
   135 │       stp    q6, q7, [x13, #32]
   865 │       stp    q16, q17, [x13], #64
  2649 │     ↑ b.ne   234

This assembly is generated from running a simple DAXPY loop unrolled by a
factor of 4. Attached is a snippet of the ll file.

Two questions, The slow code is only generated when opt is passed '-O2', which
pass could be responsible for vectorizing these loads and stores? Secondly,
what is the rationale for generating LD4/ST4 instructions if they execute so
much slower that there scalar equivalent versions?</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>