[llvm-bugs] [Bug 44655] New: vector load and store instructions (LD4, ST4) slow execution performance

Fri Jan 24 16:33:50 PST 2020

https://bugs.llvm.org/show_bug.cgi?id=44655

            Bug ID: 44655
           Summary: vector load and store instructions (LD4, ST4) slow
                    execution performance
           Product: libraries
           Version: 9.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Keywords: performance
          Severity: enhancement
          Priority: P
         Component: Backend: AArch64
          Assignee: unassignedbugs at nondot.org
          Reporter: sbiersdorff at nvidia.com
                CC: arnaud.degrandmaison at arm.com,
                    llvm-bugs at lists.llvm.org, peter.smith at linaro.org,
                    Ties.Stuij at arm.com

Created attachment 23061
  --> https://bugs.llvm.org/attachment.cgi?id=23061&action=edit
LL file snippet

The following generated assembly takes twice as long to execute versus a
version that only load register in pairs (or one-by-one):

  1303 │220:   ld4    {v2.2d-v5.2d}, [x13], #64
  4888 │       ld4    {v16.2d-v19.2d}, [x14]
 20143 │       fmla   v16.2d, v2.2d, v1.2d
    68 │       fmla   v17.2d, v3.2d, v1.2d
  1071 │       fmla   v18.2d, v4.2d, v1.2d
   293 │       fmla   v19.2d, v5.2d, v1.2d
  4524 │       st4    {v16.2d-v19.2d}, [x14], #64
 15579 │       subs   x15, x15, #0x2
    11 │     ↑ b.ne   220

Much better is to load in pair of scalars (even though that results in more
instructions being executed):

   487 │234:   ldp    q2, q3, [x12, #32]
  1106 │       ldp    q4, q5, [x12], #64
  2694 │       ldp    q6, q7, [x13, #32]
  2898 │       ldp    q16, q17, [x13]
  3847 │       subs   x14, x14, #0x2
  5440 │       fmla   v6.2d, v2.2d, v1.2d
  1689 │       fmla   v16.2d, v4.2d, v1.2d
  3530 │       fmla   v17.2d, v5.2d, v1.2d
  1315 │       fmla   v7.2d, v3.2d, v1.2d
   135 │       stp    q6, q7, [x13, #32]
   865 │       stp    q16, q17, [x13], #64
  2649 │     ↑ b.ne   234

This assembly is generated from running a simple DAXPY loop unrolled by a
factor of 4. Attached is a snippet of the ll file.

Two questions, The slow code is only generated when opt is passed '-O2', which
pass could be responsible for vectorizing these loads and stores? Secondly,
what is the rationale for generating LD4/ST4 instructions if they execute so
much slower that there scalar equivalent versions?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200125/6103732e/attachment.html>