<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - vfmaq_lane_f16 generates a dup"
   href="https://bugs.llvm.org/show_bug.cgi?id=45467">45467</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>vfmaq_lane_f16 generates a dup
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>tools
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>trunk
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>opt
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>fbarchard@google.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>clang on aarch64 (android) generates a dup for vfmaq_lane_f16

For the function xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64
This is the inner loop
 ldr    d4, [x21],#8
 ldr    d5, [x19],#8
 ldr    d6, [x22],#8
 ldr    d7, [x7],#8
 ldp    q16, q17, [x4]
 dup    v18.8h, v4.h[0]
 sub    x23, x23, #0x8
 cmp    x23, #0x7
 fmla    v0.8h, v18.8h, v16.8h
 dup    v18.8h, v5.h[0]
 fmla    v3.8h, v18.8h, v16.8h
 dup    v18.8h, v6.h[0]
 fmla    v2.8h, v18.8h, v16.8h
 dup    v18.8h, v7.h[0]
 fmla    v1.8h, v18.8h, v16.8h
 dup    v16.8h, v4.h[1]
 dup    v18.8h, v5.h[1]
 fmla    v0.8h, v16.8h, v17.8h
 dup    v16.8h, v6.h[1]
 fmla    v3.8h, v18.8h, v17.8h
 dup    v18.8h, v7.h[1]
 fmla    v2.8h, v16.8h, v17.8h
 fmla    v1.8h, v18.8h, v17.8h
 ldp    q16, q17, [x4,#32]
 dup    v18.8h, v4.h[2]
 dup    v4.8h, v4.h[3]
 add    x4, x4, #0x40
 fmla    v0.8h, v18.8h, v16.8h
 dup    v18.8h, v5.h[2]
 fmla    v3.8h, v18.8h, v16.8h
 dup    v18.8h, v6.h[2]
 fmla    v2.8h, v18.8h, v16.8h
 dup    v18.8h, v7.h[2]
 fmla    v1.8h, v18.8h, v16.8h
 dup    v5.8h, v5.h[3]
 dup    v6.8h, v6.h[3]
 dup    v7.8h, v7.h[3]
 fmla    v0.8h, v4.8h, v17.8h
 fmla    v3.8h, v5.8h, v17.8h
 fmla    v2.8h, v6.8h, v17.8h
 fmla    v1.8h, v7.8h, v17.8h
 b.hi    2eb70 <xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64+0x9c>

Instead of
 dup    v18.8h, v5.h[0]
 fmla    v3.8h, v18.8h, v16.8h
the compiler could generate
 fmla    v3.8h, v16.8h, v5.h[0]

This is a similar (not identical) kernel:
xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32
but written in assembly
ldr     s0, [x3],#4
ldp     q20, q21, [x5],#32
ldr     s1, [x11],#4
ldr     s2, [x12],#4
ldr     s3, [x4],#4
fmla    v16.8h, v20.8h, v0.h[0]
fmla    v17.8h, v21.8h, v0.h[0]
fmla    v18.8h, v20.8h, v1.h[0]
fmla    v19.8h, v21.8h, v1.h[0]
ldp     q22, q23, [x5],#32
fmla    v28.8h, v20.8h, v2.h[0]
fmla    v29.8h, v21.8h, v2.h[0]
fmla    v30.8h, v20.8h, v3.h[0]
fmla    v31.8h, v21.8h, v3.h[0]
fmla    v16.8h, v22.8h, v0.h[1]
fmla    v17.8h, v23.8h, v0.h[1]
fmla    v18.8h, v22.8h, v1.h[1]
fmla    v19.8h, v23.8h, v1.h[1]
fmla    v28.8h, v22.8h, v2.h[1]
fmla    v29.8h, v23.8h, v2.h[1]
subs    x0, x0, #0x4
fmla    v30.8h, v22.8h, v3.h[1]
fmla    v31.8h, v23.8h, v3.h[1]
b.cs    2f704 <xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32+0x64>

benchmarking the functions on a Pixel 4 (Cortex A76), the intrinsics version is
1.89 times slower
f16_gemm_4x16__aarch64_neonfp16arith_ld32                6618948 
f16_gemm_4x8__neonfp16arith_ld64                         12543422

clang --version

Android (5900059 based on r365631c) clang version 9.0.8
(<a href="https://android.googlesource.com/toolchain/llvm-project">https://android.googlesource.com/toolchain/llvm-project</a>
207d7abc1a2abf3ef8d4301736d6a7ebc224a290) (based on LLVM 9.0.8svn)

See also
<a href="https://github.com/google/XNNPACK/blob/master/src/f16-gemm/gen/4x8-neonfp16arith-ld64.c">https://github.com/google/XNNPACK/blob/master/src/f16-gemm/gen/4x8-neonfp16arith-ld64.c</a></pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>