<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 19 December 2013 11:16, suyog sarda <span dir="ltr"><<a href="mailto:sardask01@gmail.com" target="_blank">sardask01@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="gmail_extra">Test case name : llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  - This is a 4x4 matrix multiplication, we can make small changes to make it a 3x3 matrix multiplication for making things simple to understand .<br>

</div></div></div></div></div></blockquote><div><br></div><div>This is one very specific case. How does that behave on all other cases? Normally, every big improvement comes with a cost, and if you only look at the benchmark you're tuning to, you'll never see it. It may be that the cost is small and that we decide to pay the price, but not until we know that the cost is.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="gmail_extra">

</div><div class="gmail_extra">This was tested on real hardware. Time taken for a 4x4 matrix multiplication:<br></div></div></div></div></div></blockquote><div><br></div><div>What hardware? A7? A8? A9? A15?</div><div><br>

</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div><div class="gmail_extra">Also, as stated by Renato - "there is a pipeline stall between two sequential VMLAs (possibly due to 

the need of re-use of some registers) and this made code much slower 

than a sequence of VMLA+VMUL+VADD" , when i use -mcpu=cortex-a15 as option, clang emits vmla instructions back to back(sequential) . Is there something different with cortex-a15 regarding pipeline stalls, that we are ignoring back to back vmla hazards?<br>

</div></div></div></div></div></blockquote><div></div></div><br></div><div class="gmail_extra">A8 and A15 are quite different beasts. I haven't read about this hazard in the A15 manual, so I suspect that they have fixed whatever was causing the stall.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">cheers,</div><div class="gmail_extra">--renato</div></div>