[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Mon Nov 9 23:25:20 PST 2009

On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:

>>
>> On the A8, an ARM store after NEON stores to the same 16-byte block
>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>> It's worse if the NEON store was split across a 16-byte boundary,  
>> then
>> there could be a 50 cycle stall.
>>
>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>> some more details and benchmarks.
>
> If that's the case, then for A8 we should only do this when there
> won't be trailing scalar load / stores.

It should be safe if the start pointer is known 16-byte aligned.  The  
trailing stores won't be in the same 16-byte chunk.

-Chris