[LLVMdev] speed up memcpy intrinsic using ARM Neon registers
    Rodolph Perfetta 
    rodolph.perfetta at arm.com
       
    Fri Nov 13 03:27:39 PST 2009
    
    
  
> Can you comment on David Conrad's message in this thread regarding
> a ~20 cycle penalty for an ARM store following a NEON store to the
> same 16-byte block?
It is correct for A8: a NEON store followed by an ARM store in the same 16
bytes block will incur a penalty (20 cycles sounds about right) as the CPU
ensures there are no data hazards.
A9 does not have this penalty.
> If the memcpy size is not a multiple of 8, we need some ARM load/store
> instructions to copy the tail end of it. The context here is LLVM
> generating inline code for small copies, so if there is a penalty
> like that, it is probably not worthwhile to use NEON unless the
> alignment shows that the tail will be in a separate 16-byte block.
I agree it is probably not worthwhile (though I assume using NEON releases
pressure on your register allocator), it is usually not recommended to mix
ARM/NEON memory operation.
Also the NEON engines tend to have a deeper pipeline than the ARM integer
cores, so the delay to store the first bytes is likely to be higher using
NEON (although it should be faster afterwards). So for very small memcpy (20
bytes or less) ARM will be faster. For best performance remember to use PLD.
For A9 you have more to take into account: A9 is a superscalar, dual issue,
out of order and speculative CPU but this only applies to the ARM integer
core, NEON and VFP are single issue in order. However an ARM instruction can
be issued with a NEON or VFP instruction. So if you have some VFP/NEON code
before the memcpy, by the time the CPU reaches the inline NEON memcpy it
might not have finished the previous NEON/VFP instruction and you'll have to
wait ...
> (And what's up with the 16-byte divisions? I thought the cache
> lines are 64 bytes....)
Cache line is 64 bytes on A8 and 32 bytes on A9. 16 bytes is the size of an
internal buffer use by the load/store unit.
Cheers,
Rodolph.
    
    
More information about the llvm-dev
mailing list