[LLVMdev] LLVM ARM VMLA instruction

Renato Golin renato.golin at linaro.org
Thu Dec 19 03:06:03 PST 2013


On 19 December 2013 08:50, suyog sarda <sardask01 at gmail.com> wrote:

> It may seem that total number of cycles are more or less same for single
> vmla and vmul+vadd. However, when vmul+vadd combination is used instead of
> vmla, then intermediate results will be generated which needs to be stored
> in memory for future access. This will lead to lot of load/store ops being
> inserted which degrade performance. Correct me if i am wrong on this, but
> my observation till date have shown this.
>

VMLA.F can be either NEON or VFP on A series and the encoding will
determine which will be used. In assembly files, the difference is mainly
the type vs. the registers used.

The problem we were trying to avoid a long time ago was well researched by
Evan Cheng and it has shown that there is a pipeline stall between two
sequential VMLAs (possibly due to the need of re-use of some registers) and
this made code much slower than a sequence of VMLA+VMUL+VADD.

Also, please note that, as accurate as cycle counts go, according to the A9
manual, one VFP VMLA takes almost as long as a pair of VMUL+VADD to provide
the results, so a sequence of VMUL+VADD might be faster, in some contexts
or cores, than half the sequence of VMLAs.

As Tim and David said and I agree, without hard data, anything we say might
be used against us. ;)

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131219/68a15bea/attachment.html>


More information about the llvm-dev mailing list