<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - vector load and store instructions (LD4, ST4) slow execution performance"

   href="https://bugs.llvm.org/show_bug.cgi?id=44655">44655</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>vector load and store instructions (LD4, ST4) slow execution performance

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>9.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: AArch64

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>sbiersdorff@nvidia.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>arnaud.degrandmaison@arm.com, llvm-bugs@lists.llvm.org, peter.smith@linaro.org, Ties.Stuij@arm.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=23061" name="attach_23061" title="LL file snippet">attachment 23061</a> <a href="attachment.cgi?id=23061&action=edit" title="LL file snippet">[details]</a></span>

LL file snippet

The following generated assembly takes twice as long to execute versus a

version that only load register in pairs (or one-by-one):

  1303 │220:   ld4    {v2.2d-v5.2d}, [x13], #64

  4888 │       ld4    {v16.2d-v19.2d}, [x14]

 20143 │       fmla   v16.2d, v2.2d, v1.2d

    68 │       fmla   v17.2d, v3.2d, v1.2d

  1071 │       fmla   v18.2d, v4.2d, v1.2d

   293 │       fmla   v19.2d, v5.2d, v1.2d

  4524 │       st4    {v16.2d-v19.2d}, [x14], #64

 15579 │       subs   x15, x15, #0x2

    11 │     ↑ b.ne   220

Much better is to load in pair of scalars (even though that results in more

instructions being executed):

   487 │234:   ldp    q2, q3, [x12, #32]

  1106 │       ldp    q4, q5, [x12], #64

  2694 │       ldp    q6, q7, [x13, #32]

  2898 │       ldp    q16, q17, [x13]

  3847 │       subs   x14, x14, #0x2

  5440 │       fmla   v6.2d, v2.2d, v1.2d

  1689 │       fmla   v16.2d, v4.2d, v1.2d

  3530 │       fmla   v17.2d, v5.2d, v1.2d

  1315 │       fmla   v7.2d, v3.2d, v1.2d

   135 │       stp    q6, q7, [x13, #32]

   865 │       stp    q16, q17, [x13], #64

  2649 │     ↑ b.ne   234

This assembly is generated from running a simple DAXPY loop unrolled by a

factor of 4. Attached is a snippet of the ll file.

Two questions, The slow code is only generated when opt is passed '-O2', which

pass could be responsible for vectorizing these loads and stores? Secondly,

what is the rationale for generating LD4/ST4 instructions if they execute so

much slower that there scalar equivalent versions?</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>