[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Tue Nov 10 11:27:31 PST 2009

On Nov 9, 2009, at 11:25 PM, Chris Lattner wrote:

> 
> On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
> 
>>> 
>>> On the A8, an ARM store after NEON stores to the same 16-byte block
>>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>>> It's worse if the NEON store was split across a 16-byte boundary, then
>>> there could be a 50 cycle stall.
>>> 
>>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>>> some more details and benchmarks.
>> 
>> If that's the case, then for A8 we should only do this when there
>> won't be trailing scalar load / stores.
> 
> It should be safe if the start pointer is known 16-byte aligned.  The trailing stores won't be in the same 16-byte chunk.

According to
http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/

There are secondary effects if the load / store are within 64-byte block.

Evan

> 
> -Chris