[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
Evan Cheng
evan.cheng at apple.com
Mon Nov 9 23:13:28 PST 2009
On Nov 9, 2009, at 5:59 PM, David Conrad wrote:
> On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
>
>> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
>> memcpy intrinsic. I used the Neon load multiple instruction to move
>> up
>> to 48 bytes at a time . Over 15 scalar instructions collapsed down
>> into these 2 Neon instructions.
Nice. Thanks for working on this. It has long been on my todo list.
>>
>> fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
>> fstmiad r1, {d0, d1, d2, d3, d4, d5}
>>
>> It seems like this should be faster. But I did not see any
>> appreciable speedup.
Even if it's not faster, it's still a code size win which is also
important. Are we generating the right aligned NEON load / stores?
>>
>> I think the patch is correct. The code runs fine.
>>
>> I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to
>> this email.
>>
>> Does this look like the right modification?
>>
>> Does anyone have any insights into why this is not way faster than
>> using scalar registers?
>
> On the A8, an ARM store after NEON stores to the same 16-byte block
> incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
> It's worse if the NEON store was split across a 16-byte boundary, then
> there could be a 50 cycle stall.
>
> See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
> some more details and benchmarks.
If that's the case, then for A8 we should only do this when there
won't be trailing scalar load / stores.
Evan
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
More information about the llvm-dev
mailing list