<div dir="ltr">Hi Tim,<br><br><br><div><div class="gmail_extra"><br><br><div class="gmail_quote"><br>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im"><br>

> cortex-a15 vfpv4 : vmla instruction emitted (which is a NEON instruction)<br>

<br>

</div>I get a VFP vmla here rather than a NEON one (clang -target<br>

armv7-linux-gnueabihf -mcpu=cortex-a15): "vmla.f32 s0, s1, s2". Are<br>

you seeing something different?<br></blockquote><div><br></div><div>As per <img id=":0_41-e" name=":0" src="https://plus.google.com/_/focus/photos/public/AIbEiAIAAABDCNbyvdrX5rTBVyILdmNhcmRfcGhvdG8qKDRkYzAwMDM4OTIwNDE1OWEzMTA4MGE4ZmU0ZjhmYmRlOTY0MDNlY2UwASHwPdr9yhyesZjw2KXErlwSQLZP?sz=32" class=""><span name="Renato Golin" class="">Renato comment above, vmla instruction is NEON instruction while vmfa is VFP instruction. Correct me if i am wrong on this.<br>

</span></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div class="im"><br>

> However, if gcc emits vmla (NEON) instruction with cortex-a8 then shouldn't<br>

> LLVM also emit vmla (NEON) instruction?<br>

<br>

</div>It appears we've decided in the past that vmla just isn't worth it on<br>

Cortex-A8. There's this comment in the source:<br>

<br>

// Some processors have FP multiply-accumulate instructions that don't<br>

// play nicely with other VFP / NEON instructions, and it's generally better<br>

// to just not use them.<br>

<br>

Sufficient benchmarking evidence could overturn that decision, but I<br>

assume the people who added it in the first place didn't do so on a<br>

whim.<br>

<div class="im"><br>

> The performance gain with vmla instruction is huge.<br>

<br>

</div>Is it, on Cortex-A8? The TRM referrs to them jumping across pipelines<br>

in odd ways, and that was a very primitive core so it's almost<br>

certainly not going to be just as good as a vmul (in fact if I'm<br>

reading correctly, it takes pretty much exactly the same time as<br>

separate vmul and vadd instructions, 10 cycles vs 2 * 5).<br></blockquote><div><br></div><div>It may seem that total number of cycles are more or less same for single vmla and vmul+vadd. However, when vmul+vadd combination is used instead of vmla, then intermediate results will be generated which needs to be stored in memory for future access. This will lead to lot of load/store ops being inserted which degrade performance. Correct me if i am wrong on this, but my observation till date have shown this. <br>

</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Cheers.<br>

<span class=""><font color="#888888"><br>

Tim.<br>

</font></span></blockquote></div><br><br clear="all"><br>-- <br>With regards,<br>Suyog Sarda<br>

</div></div></div>