[LLVMdev] LLVM ARM VMLA instruction

suyog sarda sardask01 at gmail.com
Thu Dec 19 01:28:46 PST 2013


On Thu, Dec 19, 2013 at 2:43 PM, Tim Northover <t.p.northover at gmail.com>wrote:

> > As per Renato comment above, vmla instruction is NEON instruction while
> vmfa is VFP instruction. Correct me if i am wrong on this.
>
> My version of the ARM architecture reference manual (v7 A & R) lists
> versions requiring NEON and versions requiring VFP. (Section
> A8.8.337). Split in just the way you'd expect (SIMD variants need
> NEON).
>

I will check on this part.


>
> > It may seem that total number of cycles are more or less same for single
> vmla
> > and vmul+vadd. However, when vmul+vadd combination is used instead of
> vmla,
> > then intermediate results will be generated which needs to be stored in
> memory
> > for future access.
>
> Well, it increases register pressure slightly I suppose, but there's
> no need to store anything to memory unless that gets critical.
>
> > Correct me if i am wrong on this, but my observation till date have
> shown this.
>
> Perhaps. Actual data is needed, I think, if you seriously want to
> change this behaviour in LLVM. The test-suite might be a good place to
> start, though it'll give an incomplete picture without the externals
> (SPEC & other things).
>
> Of course, if we're just speculating we can carry on.
>

I wasn't speculating. Let's take an example of a 3*3 simple matrix
multiplication (no loops, all multiplication and additions are hard coded -
basically all the operations are expanded
e.g Result[0][0] = A[0][0]*B[0][0] + A[0][1]*B[1][0] + A[0][2]*B[2][0]  and
so on for all 9 elements of the result ).

If i compile above code with "clang -O3 -mcpu=cortex-a8 -mfpu=vfpv3-d16"
(only 16 floating point registers present with my arm, so specifying
vfpv3-d16), there are 27 vmul, 18 vadd, 23 store and 30 load  ops in total.
If same is compiled with gcc with same options there are 9 vmul, 18 vmla, 9
store and 20 load ops. So, its clear that extra load/store ops gets added
with clang as it is not emitting vmla instruction. Won't this lead to
performance degradation?

I would also like to know about accuracy with vmla and pair of vmul and
vadd ops.


-- 
With regards,
Suyog Sarda
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/126b3c33/attachment.html>


More information about the llvm-dev mailing list