[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Thu Dec 6 03:53:15 PST 2018

pcordes added a comment.

In D55263#1320043 <https://reviews.llvm.org/D55263#1320043>, @spatel wrote:

> I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
>
> 1. Are there any known uarch problems with overlapping loads?

No, other than it implies unaligned.  Even overlapping stores are fine, and are absorbed by the store buffer.

With very recently stored data, we might possibly be introducing store-forwarding stalls by misaligning a load relative to an earlier store.  (Separate from the issue of absolute alignment.)

But if it was copies with a pair of overlapping loads/stores, then hopefully we load/store in an order that allows one load to fully overlap one of the stores that put the data there.  (glibc `memcpy` uses a pair of overlapping loads + a pair of stores for sizes up to 2x the vector width.  https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19  has nice comments describing the strategy.  But I forget what happens for inlined memcpy for compile-time constant sizes with gcc and llvm.  This is only relevant where `memcmp` can inline, and that's likely to be cases where a `memcpy` also inlined if there was a memcpy involved in the source at all.

> 2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?

*Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries.  (Or on AMD, maybe even 32-byte or even 16-byte boundaries).  There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).

On Core2 and earlier, and K8 and earlier, `movups` or `movdqu` unaligned 16-byte loads are slowish even if the vector load doesn't cross a cache-line boundary.  (The instruction decodes to multiple uops using a pessimistic strategy.)  Nehalem and K10 have efficient unaligned vector loads.  (Nehalem and Bulldozer have efficient unaligned vector *stores*.)

But I expect it's still worth it vs. a memcpy library function call, even on old CPUs for 16-byte vectors.

*Page* splits (4k boundary) are much slower on Intel before Skylake.  Apparently Intel discovered that page splits in real life are more common than they had been tuning for, so they put extra hardware in Skylake to make the latency no worse than a cache-line split, and thoughput still decent, when both sides get TLB hits.

I tested some of this a while ago: https://stackoverflow.com/questions/45128763/how-can-i-accurately-benchmark-unaligned-access-speed-on-x86-64 That has a decent summary of the things to watch out for when worrying about unaligned loads.

----

On non-x86, I'm not sure how unaligned loads are handled in hardware.  I know many ISAs do support them, like MIPS32r6 requires them, and I think AArch64.  I can't comment on the efficiency.  I think it takes a significant amount of transistors to make it as cheap as on modern x86.  But maybe still worth it vs. spending more instructions.  One unaligned load is probably not going to be much more than 2 or 3 aligned loads.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263