<div dir="ltr"> I use clang9 to build code which has many arm64 intrinsics. I use vmlaq_f32 to perform multiply accumulate operations on float32x4_t data type. I have expected fmla instruction will be generated but instead clang generate a fmul and a fadd instruction for me. For simple function this is not an issue but for function which use a lot of neon registers clang9 will generate inefficient code which will store/load neon register to/from stack frequently. But if clang generate
fmla instruction 32 neon register is more than enough.<div> BTW: I have tested a function which use
vmlaq_f32 heavily, If I build it for armv7-a it will generate very
efficient code(it will generate vmla instruction in this case), but if I build it for armv8-a the generated code looks very inefficient with many store/load to/from stack.</div><div> Is there a way to force clang9 generate
fmla
instruction for
vmlaq_f32? Thanks.</div></div>