[LLVMdev] LLVM ARM VMLA instruction

Thu Dec 19 05:30:30 PST 2013

 Test case name :
>> llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
>> This is a 4x4 matrix multiplication, we can make small changes to make it a
>> 3x3 matrix multiplication for making things simple to understand .
>>
>
> This is one very specific case. How does that behave on all other cases?
> Normally, every big improvement comes with a cost, and if you only look at
> the benchmark you're tuning to, you'll never see it. It may be that the
> cost is small and that we decide to pay the price, but not until we know
> that the cost is.
>
>
I agree that we should approach in whole than in bits and pieces. I was
basically comparing performance of clang and gcc code for benchmarks listed
in llvm trunk. I found that wherever there was floating point ops
(specifically floating point multiplication), performance with clang was
bad. On analyzing further those issues, i came across vmla instruction by
gcc. The test cases hit by bad performance of clang are :

Test
Case
No of vmla instructions emitted by gcc (clang does not emit vmla for
cortex-a8)
===========
=======================================================

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/sphereflake
  55

llvm/projects/test-suite/SingleSource/Benchmarks/Misc-C++/Large/ray.cpp
40

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/ffbench.c
8

llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c
18

llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c
36

With vmul+vadd instruction pair comes extra overhead of load/store ops, as
seen in assembly generated. With -mcpu=cortex-a15 option clang performs
better, as it emits vmla instructions.

>
> This was tested on real hardware. Time taken for a 4x4 matrix
>> multiplication:
>>
>
> What hardware? A7? A8? A9? A15?
>

I tested it on A15, i don't have access to A8 rightnow, but i intend to
test it for A8 as well. I compiled the code for A8 and as it was working
fine on A15 without any crash, i went ahead with cortex-a8 option. I don't
think i will get A8 hardware soon, can someone please check it on A8
hardware as well (Sorry for the trouble)?

>
>
> Also, as stated by Renato - "there is a pipeline stall between two
>> sequential VMLAs (possibly due to the need of re-use of some registers) and
>> this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
>> -mcpu=cortex-a15 as option, clang emits vmla instructions back to
>> back(sequential) . Is there something different with cortex-a15 regarding
>> pipeline stalls, that we are ignoring back to back vmla hazards?
>>
>
> A8 and A15 are quite different beasts. I haven't read about this hazard in
> the A15 manual, so I suspect that they have fixed whatever was causing the
> stall.
>

Ok. I couldn't find reference for this. If the pipeline stall issue was
fixed in cortex-a15 then LLVM developers will definitely know about this
(and hence we are emitting vmla for cortex-a15). I couldn't find any
comment related to this in the code. Can someone please point it out? Also,
I will be glad to know the code place where we start differentiating
between cortex-a8 and cortex-a15 for code generation.

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/94623334/attachment.html>