[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Rodolph Perfetta rodolph.perfetta at arm.com
Wed Nov 11 03:27:59 PST 2009


> >> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
> >> memcpy intrinsic. I used the Neon load multiple instruction to move
> >> up to 48 bytes at a time . Over 15 scalar instructions collapsed
> >> down into these 2 Neon instructions.
> 
> Nice. Thanks for working on this. It has long been on my todo list.
> 
> >>
> >>      fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
> >>      fstmiad r1, {d0, d1, d2, d3, d4, d5}
> >>
> >> It seems like this should be faster. But I did not see any
> >> appreciable speedup.

If you know about the alignment, maybe use structured load/store
(vst1.64/vld1.64 {dn-dm}). You may also want to work on whole cache lines
(64 bytes on A8). You can find more in this discussion:
http://groups.google.com/group/beagleboard/browse_thread/thread/12c7bd415fbc
0993/e382202f1a92b0f8?lnk=gst&q=memcpy&pli=1 .
 
> Even if it's not faster, it's still a code size win which is also
> important.

Yes but NEON will drive up your power consumption, so if you are not faster
you will drain your battery faster (assuming you care of course).

In general we wouldn't recommend writing memcpy using NEON unless you can
detect the exact core you will be running on: on A9 NEON will not give you
any speed up, you'll just end up using more power. NEON is a SIMD engine.

If one wanted to write memcpy on A9 we would recommend something like:
 * do not use NEON
 * use PLD (3-6 cache lines ahead, to be tuned)
 * ldm/stm whole cache lines (32 bytes on A9)
 * align destination

Cheers,
Rodolph.





More information about the llvm-dev mailing list