[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
David Conrad
lessen42 at gmail.com
Mon Nov 9 17:59:07 PST 2009
On Nov 9, 2009, at 7:34 PM, Neel Nagar wrote:
> I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
> memcpy intrinsic. I used the Neon load multiple instruction to move up
> to 48 bytes at a time . Over 15 scalar instructions collapsed down
> into these 2 Neon instructions.
>
> fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
> fstmiad r1, {d0, d1, d2, d3, d4, d5}
>
> It seems like this should be faster. But I did not see any
> appreciable speedup.
>
> I think the patch is correct. The code runs fine.
>
> I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to
> this email.
>
> Does this look like the right modification?
>
> Does anyone have any insights into why this is not way faster than
> using scalar registers?
On the A8, an ARM store after NEON stores to the same 16-byte block
incurs a ~20 cycle penalty since the NEON unit executes behind ARM.
It's worse if the NEON store was split across a 16-byte boundary, then
there could be a 50 cycle stall.
See http://hardwarebug.org/2008/12/31/arm-neon-memory-hazards/ for
some more details and benchmarks.
More information about the llvm-dev
mailing list