<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - vector load and store instructions (LD4, ST4) slow execution performance"
href="https://bugs.llvm.org/show_bug.cgi?id=44655">44655</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>vector load and store instructions (LD4, ST4) slow execution performance
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>9.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Keywords</th>
<td>performance
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: AArch64
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>sbiersdorff@nvidia.com
</td>
</tr>
<tr>
<th>CC</th>
<td>arnaud.degrandmaison@arm.com, llvm-bugs@lists.llvm.org, peter.smith@linaro.org, Ties.Stuij@arm.com
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=23061" name="attach_23061" title="LL file snippet">attachment 23061</a> <a href="attachment.cgi?id=23061&action=edit" title="LL file snippet">[details]</a></span>
LL file snippet
The following generated assembly takes twice as long to execute versus a
version that only load register in pairs (or one-by-one):
1303 │220: ld4 {v2.2d-v5.2d}, [x13], #64
4888 │ ld4 {v16.2d-v19.2d}, [x14]
20143 │ fmla v16.2d, v2.2d, v1.2d
68 │ fmla v17.2d, v3.2d, v1.2d
1071 │ fmla v18.2d, v4.2d, v1.2d
293 │ fmla v19.2d, v5.2d, v1.2d
4524 │ st4 {v16.2d-v19.2d}, [x14], #64
15579 │ subs x15, x15, #0x2
11 │ ↑ b.ne 220
Much better is to load in pair of scalars (even though that results in more
instructions being executed):
487 │234: ldp q2, q3, [x12, #32]
1106 │ ldp q4, q5, [x12], #64
2694 │ ldp q6, q7, [x13, #32]
2898 │ ldp q16, q17, [x13]
3847 │ subs x14, x14, #0x2
5440 │ fmla v6.2d, v2.2d, v1.2d
1689 │ fmla v16.2d, v4.2d, v1.2d
3530 │ fmla v17.2d, v5.2d, v1.2d
1315 │ fmla v7.2d, v3.2d, v1.2d
135 │ stp q6, q7, [x13, #32]
865 │ stp q16, q17, [x13], #64
2649 │ ↑ b.ne 234
This assembly is generated from running a simple DAXPY loop unrolled by a
factor of 4. Attached is a snippet of the ll file.
Two questions, The slow code is only generated when opt is passed '-O2', which
pass could be responsible for vectorizing these loads and stores? Secondly,
what is the rationale for generating LD4/ST4 instructions if they execute so
much slower that there scalar equivalent versions?</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>