[LLVMdev] speed up memcpy intrinsic using ARM Neon registers

Mon Nov 9 23:13:28 PST 2009

On Nov 9, 2009, at 5:59 PM, David Conrad wrote:

> On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
>
>> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
>> memcpy intrinsic. I used the Neon load multiple instruction to move  
>> up
>> to 48 bytes at a time . Over 15 scalar instructions collapsed down
>> into these 2 Neon instructions.

Nice. Thanks for working on this. It has long been on my todo list.

>>
>>      fldmiad r3, {d0, d1, d2, d3, d4, d5}  @ SrcLine dhrystone.c 359
>>      fstmiad r1, {d0, d1, d2, d3, d4, d5}
>>
>> It seems like this should be faster. But I did not see any
>> appreciable speedup.

Even if it's not faster, it's still a code size win which is also  
important. Are we generating the right aligned NEON load / stores?

>>
>> I think the patch is correct. The code runs fine.
>>
>> I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to
>> this email.
>>
>> Does this look like the right modification?
>>
>> Does anyone have any insights into why this is not way faster than
>> using scalar registers?
>
> On the A8, an ARM store after NEON stores to the same 16-byte block
> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
> It's worse if the NEON store was split across a 16-byte boundary, then
> there could be a 50 cycle stall.
>
> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
> some more details and benchmarks.

If that's the case, then for A8 we should only do this when there  
won't be trailing scalar load / stores.

Evan

> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev