[llvm-bugs] [Bug 45467] New: vfmaq_lane_f16 generates a dup

Tue Apr 7 14:02:54 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=45467

            Bug ID: 45467
           Summary: vfmaq_lane_f16 generates a dup
           Product: tools
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: opt
          Assignee: unassignedbugs at nondot.org
          Reporter: fbarchard at google.com
                CC: llvm-bugs at lists.llvm.org

clang on aarch64 (android) generates a dup for vfmaq_lane_f16

For the function xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64
This is the inner loop
 ldr    d4, [x21],#8
 ldr    d5, [x19],#8
 ldr    d6, [x22],#8
 ldr    d7, [x7],#8
 ldp    q16, q17, [x4]
 dup    v18.8h, v4.h[0]
 sub    x23, x23, #0x8
 cmp    x23, #0x7
 fmla    v0.8h, v18.8h, v16.8h
 dup    v18.8h, v5.h[0]
 fmla    v3.8h, v18.8h, v16.8h
 dup    v18.8h, v6.h[0]
 fmla    v2.8h, v18.8h, v16.8h
 dup    v18.8h, v7.h[0]
 fmla    v1.8h, v18.8h, v16.8h
 dup    v16.8h, v4.h[1]
 dup    v18.8h, v5.h[1]
 fmla    v0.8h, v16.8h, v17.8h
 dup    v16.8h, v6.h[1]
 fmla    v3.8h, v18.8h, v17.8h
 dup    v18.8h, v7.h[1]
 fmla    v2.8h, v16.8h, v17.8h
 fmla    v1.8h, v18.8h, v17.8h
 ldp    q16, q17, [x4,#32]
 dup    v18.8h, v4.h[2]
 dup    v4.8h, v4.h[3]
 add    x4, x4, #0x40
 fmla    v0.8h, v18.8h, v16.8h
 dup    v18.8h, v5.h[2]
 fmla    v3.8h, v18.8h, v16.8h
 dup    v18.8h, v6.h[2]
 fmla    v2.8h, v18.8h, v16.8h
 dup    v18.8h, v7.h[2]
 fmla    v1.8h, v18.8h, v16.8h
 dup    v5.8h, v5.h[3]
 dup    v6.8h, v6.h[3]
 dup    v7.8h, v7.h[3]
 fmla    v0.8h, v4.8h, v17.8h
 fmla    v3.8h, v5.8h, v17.8h
 fmla    v2.8h, v6.8h, v17.8h
 fmla    v1.8h, v7.8h, v17.8h
 b.hi    2eb70 <xnn_f16_gemm_ukernel_4x8__neonfp16arith_ld64+0x9c>

Instead of
 dup    v18.8h, v5.h[0]
 fmla    v3.8h, v18.8h, v16.8h
the compiler could generate
 fmla    v3.8h, v16.8h, v5.h[0]

This is a similar (not identical) kernel:
xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32
but written in assembly
ldr     s0, [x3],#4
ldp     q20, q21, [x5],#32
ldr     s1, [x11],#4
ldr     s2, [x12],#4
ldr     s3, [x4],#4
fmla    v16.8h, v20.8h, v0.h[0]
fmla    v17.8h, v21.8h, v0.h[0]
fmla    v18.8h, v20.8h, v1.h[0]
fmla    v19.8h, v21.8h, v1.h[0]
ldp     q22, q23, [x5],#32
fmla    v28.8h, v20.8h, v2.h[0]
fmla    v29.8h, v21.8h, v2.h[0]
fmla    v30.8h, v20.8h, v3.h[0]
fmla    v31.8h, v21.8h, v3.h[0]
fmla    v16.8h, v22.8h, v0.h[1]
fmla    v17.8h, v23.8h, v0.h[1]
fmla    v18.8h, v22.8h, v1.h[1]
fmla    v19.8h, v23.8h, v1.h[1]
fmla    v28.8h, v22.8h, v2.h[1]
fmla    v29.8h, v23.8h, v2.h[1]
subs    x0, x0, #0x4
fmla    v30.8h, v22.8h, v3.h[1]
fmla    v31.8h, v23.8h, v3.h[1]
b.cs    2f704 <xnn_f16_gemm_ukernel_4x16__aarch64_neonfp16arith_ld32+0x64>

benchmarking the functions on a Pixel 4 (Cortex A76), the intrinsics version is
1.89 times slower
f16_gemm_4x16__aarch64_neonfp16arith_ld32                6618948 
f16_gemm_4x8__neonfp16arith_ld64                         12543422

clang --version

Android (5900059 based on r365631c) clang version 9.0.8
(https://android.googlesource.com/toolchain/llvm-project
207d7abc1a2abf3ef8d4301736d6a7ebc224a290) (based on LLVM 9.0.8svn)

See also
https://github.com/google/XNNPACK/blob/master/src/f16-gemm/gen/4x8-neonfp16arith-ld64.c

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200407/13d89e77/attachment.html>