<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - vfmaq_lane_f16 generates a dup"
href="https://bugs.llvm.org/show_bug.cgi?id=45467">45467</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>vfmaq_lane_f16 generates a dup
</td>
</tr>
<tr>
<th>Product</th>
<td>tools
</td>
</tr>
<tr>
<th>Version</th>
<td>trunk
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>opt
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>fbarchard@google.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>clang on aarch64 (android) generates a dup for vfmaq_lane_f16
For the function xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64
This is the inner loop
ldr d4, [x21],#8
ldr d5, [x19],#8
ldr d6, [x22],#8
ldr d7, [x7],#8
ldp q16, q17, [x4]
dup v18.8h, v4.h[0]
sub x23, x23, #0x8
cmp x23, #0x7
fmla v0.8h, v18.8h, v16.8h
dup v18.8h, v5.h[0]
fmla v3.8h, v18.8h, v16.8h
dup v18.8h, v6.h[0]
fmla v2.8h, v18.8h, v16.8h
dup v18.8h, v7.h[0]
fmla v1.8h, v18.8h, v16.8h
dup v16.8h, v4.h[1]
dup v18.8h, v5.h[1]
fmla v0.8h, v16.8h, v17.8h
dup v16.8h, v6.h[1]
fmla v3.8h, v18.8h, v17.8h
dup v18.8h, v7.h[1]
fmla v2.8h, v16.8h, v17.8h
fmla v1.8h, v18.8h, v17.8h
ldp q16, q17, [x4,#32]
dup v18.8h, v4.h[2]
dup v4.8h, v4.h[3]
add x4, x4, #0x40
fmla v0.8h, v18.8h, v16.8h
dup v18.8h, v5.h[2]
fmla v3.8h, v18.8h, v16.8h
dup v18.8h, v6.h[2]
fmla v2.8h, v18.8h, v16.8h
dup v18.8h, v7.h[2]
fmla v1.8h, v18.8h, v16.8h
dup v5.8h, v5.h[3]
dup v6.8h, v6.h[3]
dup v7.8h, v7.h[3]
fmla v0.8h, v4.8h, v17.8h
fmla v3.8h, v5.8h, v17.8h
fmla v2.8h, v6.8h, v17.8h
fmla v1.8h, v7.8h, v17.8h
b.hi 2eb70 <xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64+0x9c>
Instead of
dup v18.8h, v5.h[0]
fmla v3.8h, v18.8h, v16.8h
the compiler could generate
fmla v3.8h, v16.8h, v5.h[0]
This is a similar (not identical) kernel:
xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32
but written in assembly
ldr s0, [x3],#4
ldp q20, q21, [x5],#32
ldr s1, [x11],#4
ldr s2, [x12],#4
ldr s3, [x4],#4
fmla v16.8h, v20.8h, v0.h[0]
fmla v17.8h, v21.8h, v0.h[0]
fmla v18.8h, v20.8h, v1.h[0]
fmla v19.8h, v21.8h, v1.h[0]
ldp q22, q23, [x5],#32
fmla v28.8h, v20.8h, v2.h[0]
fmla v29.8h, v21.8h, v2.h[0]
fmla v30.8h, v20.8h, v3.h[0]
fmla v31.8h, v21.8h, v3.h[0]
fmla v16.8h, v22.8h, v0.h[1]
fmla v17.8h, v23.8h, v0.h[1]
fmla v18.8h, v22.8h, v1.h[1]
fmla v19.8h, v23.8h, v1.h[1]
fmla v28.8h, v22.8h, v2.h[1]
fmla v29.8h, v23.8h, v2.h[1]
subs x0, x0, #0x4
fmla v30.8h, v22.8h, v3.h[1]
fmla v31.8h, v23.8h, v3.h[1]
b.cs 2f704 <xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32+0x64>
benchmarking the functions on a Pixel 4 (Cortex A76), the intrinsics version is
1.89 times slower
f16_gemm_4x16__aarch64_neonfp16arith_ld32 6618948
f16_gemm_4x8__neonfp16arith_ld64 12543422
clang --version
Android (5900059 based on r365631c) clang version 9.0.8
(<a href="https://android.googlesource.com/toolchain/llvm-project">https://android.googlesource.com/toolchain/llvm-project</a>
207d7abc1a2abf3ef8d4301736d6a7ebc224a290) (based on LLVM 9.0.8svn)
See also
<a href="https://github.com/google/XNNPACK/blob/master/src/f16-gemm/gen/4x8-neonfp16arith-ld64.c">https://github.com/google/XNNPACK/blob/master/src/f16-gemm/gen/4x8-neonfp16arith-ld64.c</a></pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>