[LLVMdev] LLVM ARM VMLA instruction

suyog sarda sardask01 at gmail.com
Thu Dec 19 03:16:44 PST 2013


On Thu, Dec 19, 2013 at 4:36 PM, Renato Golin <renato.golin at linaro.org>wrote:

> On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote:
>
>> It may seem that total number of cycles are more or less same for single
>> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
>> vmla, then intermediate results will be generated which needs to be stored
>> in memory for future access. This will lead to lot of load/store ops being
>> inserted which degrade performance. Correct me if i am wrong on this, but
>> my observation till date have shown this.
>>
>
> VMLA.F can be either NEON or VFP on A series and the encoding will
> determine which will be used. In assembly files, the difference is mainly
> the type vs. the registers used.
>
> The problem we were trying to avoid a long time ago was well researched by
> Evan Cheng and it has shown that there is a pipeline stall between two
> sequential VMLAs (possibly due to the need of re-use of some registers) and
> this made code much slower than a sequence of VMLA+VMUL+VADD.
>
> Also, please note that, as accurate as cycle counts go, according to the
> A9 manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to
> provide the results, so a sequence of VMUL+VADD might be faster, in some
> contexts or cores, than half the sequence of VMLAs.
>
> As Tim and David said and I agree, without hard data, anything we say
> might be used against us. ;)
>
>

Sorry folks, i didn't specify the actual test case and results in detail
previously. The details are as follows :

Test case name :
llvm/projects/test-suite/SingleSource/Benchmarks/Misc/matmul_f64_4x4.c  -
This is a 4x4 matrix multiplication, we can make small changes to make it a
3x3 matrix multiplication for making things simple to understand .

clang version : trunk version (latest as of today 19 Dec 2013)
GCC version : 4.5 (i checked with 4.8 as well)

flags passed to both gcc and clang : -march=armv7-a  -mfloat-abi=softfp
-mfpu=vfpv3-d16  -mcpu=cortex-a8
Optimization level used : O3

No vmla instruction emitted by clang but GCC happily emits it.


This was tested on real hardware. Time taken for a 4x4 matrix
multiplication:

clang : ~14 secs
gcc : ~9 secs


Time taken for a 3x3 matrix multiplication:

clang : ~6.5 secs
gcc : ~5 secs


when flag -mcpu=cortex-a8 is changed to -mcpu=cortex-a15, clang emits vmla
instructions (gcc emits by default)

Time for 4x4 matrix multiplication :

clang : ~8.5 secs
GCC : ~9secs

Time for matrix multiplication :

clang : ~3.8 secs
GCC : ~5 secs

Please let me know if i am missing something. (-ffast-math option doesn't
help in this case.) On examining assembly code for various scenarios above,
i concluded what i have stated above regarding more load/store ops.
Also, as stated by Renato - "there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD" , when i use
-mcpu=cortex-a15 as option, clang emits vmla instructions back to
back(sequential) . Is there something different with cortex-a15 regarding
pipeline stalls, that we are ignoring back to back vmla hazards?

-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/1ff94025/attachment.html>


More information about the llvm-dev mailing list