[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
Neel Nagar
neelnagar42 at gmail.com
Mon Nov 9 16:34:54 PST 2009
I tried to speed up Dhrystone on ARM Cortex-A8 by optimizing the
memcpy intrinsic. I used the Neon load multiple instruction to move up
to 48 bytes at a time . Over 15 scalar instructions collapsed down
into these 2 Neon instructions.
fldmiad r3, {d0, d1, d2, d3, d4, d5} @ SrcLine dhrystone.c 359
fstmiad r1, {d0, d1, d2, d3, d4, d5}
It seems like this should be faster. But I did not see any appreciable speedup.
I think the patch is correct. The code runs fine.
I have attached my patch for "lib/Target/ARM/ARMISelLowering.cpp" to this email.
Does this look like the right modification?
Does anyone have any insights into why this is not way faster than
using scalar registers?
I am using a BeagleBoard.
Thanks,
Neel Nagar
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memcpy_neon_091109.patch
Type: application/octet-stream
Size: 2040 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20091110/61765619/attachment.obj>
More information about the llvm-dev
mailing list