[LLVMdev] LLVM ARM VMLA instruction

Thu Dec 19 03:27:25 PST 2013

On 19 December 2013 11:16, suyog sarda <sardask01 at gmail.com> wrote:

> Test case name :
> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
> This is a 4x4 matrix multiplication, we can make small changes to make it a
> 3x3 matrix multiplication for making things simple to understand .
>

This is one very specific case. How does that behave on all other cases?
Normally, every big improvement comes with a cost, and if you only look at
the benchmark you're tuning to, you'll never see it. It may be that the
cost is small and that we decide to pay the price, but not until we know
that the cost is.

This was tested on real hardware. Time taken for a 4x4 matrix
> multiplication:
>

What hardware? A7? A8? A9? A15?

Also, as stated by Renato - "there is a pipeline stall between two
> sequential VMLAs (possibly due to the need of re-use of some registers) and
> this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
> -mcpu=cortex-a15 as option, clang emits vmla instructions back to
> back(sequential) . Is there something different with cortex-a15 regarding
> pipeline stalls, that we are ignoring back to back vmla hazards?
>

A8 and A15 are quite different beasts. I haven't read about this hazard in
the A15 manual, so I suspect that they have fixed whatever was causing the
stall.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1dff2d95/attachment.html>