[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
Chris Lattner
clattner at apple.com
Mon Nov 9 23:25:20 PST 2009
On Nov 9, 2009, at 11:13 PM, Evan Cheng wrote:
>>
>> On the A8, an ARM store after NEON stores to the same 16-byte block
>> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
>> It's worse if the NEON store was split across a 16-byte boundary,
>> then
>> there could be a 50 cycle stall.
>>
>> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
>> some more details and benchmarks.
>
> If that's the case, then for A8 we should only do this when there
> won't be trailing scalar load / stores.
It should be safe if the start pointer is known 16-byte aligned. The
trailing stores won't be in the same 16-byte chunk.
-Chris
More information about the llvm-dev
mailing list