[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
Bob Wilson
bob.wilson at apple.com
Wed Nov 11 09:20:24 PST 2009
On Nov 11, 2009, at 3:27 AM, Rodolph Perfetta wrote:
>
> If you know about the alignment, maybe use structured load/store
> (vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache
> lines
> (64 bytes on A8). You can find more in this discussion:
> http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc
> 0993/e382202f1a92b0f8?lnk=gst&q=memcpy&pli=1 .
>
>> Even if it's not faster, it's still a code size win which is also
>> important.
>
> Yes but NEON will drive up your power consumption, so if you are not
> faster
> you will drain your battery faster (assuming you care of course).
>
> In general we wouldn't recommend writing memcpy using NEON unless
> you can
> detect the exact core you will be running on: on A9 NEON will not
> give you
> any speed up, you'll just end up using more power. NEON is a SIMD
> engine.
>
> If one wanted to write memcpy on A9 we would recommend something like:
> * do not use NEON
> * use PLD (3-6 cache lines ahead, to be tuned)
> * ldm/stm whole cache lines (32 bytes on A9)
> * align destination
Thanks, Rodolph. That is very helpful.
Can you comment on David Conrad's message in this thread regarding a
~20 cycle penalty for an ARM store following a NEON store to the same
16-byte block? If the memcpy size is not a multiple of 8, we need
some ARM load/store instructions to copy the tail end of it. The
context here is LLVM generating inline code for small copies, so if
there is a penalty like that, it is probably not worthwhile to use
NEON unless the alignment shows that the tail will be in a separate 16-
byte block. (And what's up with the 16-byte divisions? I thought the
cache lines are 64 bytes....)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20091111/1cd1162f/attachment.html>
More information about the llvm-dev
mailing list