[cfe-users] Why clang doesn't generate fmla instruction for vmlaq_f32 intrinsics for armv8-a?

Mon Apr 20 23:25:24 PDT 2020

    I use clang9 to build code which has many arm64 intrinsics. I
use vmlaq_f32 to perform multiply accumulate operations on float32x4_t data
type. I have expected fmla instruction will be generated but instead clang
generate a fmul and a fadd instruction for me. For simple function this is
not an issue but for function which use a lot of neon registers clang9 will
generate inefficient code which will store/load neon register to/from stack
frequently. But if clang generate  fmla instruction 32 neon register is
more than enough.
    BTW: I have tested a function which use  vmlaq_f32 heavily, If I build
it for armv7-a it will generate very  efficient code(it will generate vmla
instruction in this case), but if I build it for armv8-a the generated code
looks very inefficient with many store/load to/from stack.
    Is there a way to force clang9 generate  fmla  instruction for
vmlaq_f32? Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-users/attachments/20200421/785c8009/attachment.html>