[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
Evan Cheng
evan.cheng at apple.com
Tue Nov 10 11:27:31 PST 2009
On Nov 9, 2009, at 11:25 PM, Chris Lattner wrote:
>
> On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
>
>>>
>>> On the A8, an ARM store after NEON stores to the same 16-byte block
>>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>>> It's worse if the NEON store was split across a 16-byte boundary, then
>>> there could be a 50 cycle stall.
>>>
>>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>>> some more details and benchmarks.
>>
>> If that's the case, then for A8 we should only do this when there
>> won't be trailing scalar load / stores.
>
> It should be safe if the start pointer is known 16-byte aligned. The trailing stores won't be in the same 16-byte chunk.
According to
http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/
There are secondary effects if the load / store are within 64-byte block.
Evan
>
> -Chris
More information about the llvm-dev
mailing list