[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Thu Dec 6 04:54:51 PST 2018

andreadb added a comment.

In D55263#1321184 <https://reviews.llvm.org/D55263#1321184>, @pcordes wrote:

> In D55263#1320043 <https://reviews.llvm.org/D55263#1320043>, @spatel wrote:
>
> > I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
> >  <snip>
> >
> > 2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?
>
>
> *Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries.  (Or on AMD, maybe even 32-byte or even 16-byte boundaries).  There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).

+1

On AMD, a misaligned store or load operation suffers a minimum one-cycle penalty if it crosses a boundary (definitely 16-byte for AMD Family 15h/16h processors).

I also agree with Peter when he says that it is still worth it to pay a small penalty vs. doing a memcpy library call.
Speaking about memcpy: It is worth mentioning that - at least on Jaguar - the LS knows how to minimize the impact of stalls due to repeated misaligned accesses. Quoting the AMDfam15h SOG: "Writes to the Data cache which are unaligned in an "address" are written in two cycles. If consecutive unaligned addressed 128-bit loads are written they can be coalesced such that the 64-bit portions of 128-bit writes which were unaligned can be merged and written 128-bits at a time, removing most the stall penalties. This is performed in the Store Coalescing Buffer (SCB)."

Back on topic:
For x86, I think this patch is a nice improvement. Not sure about other targets.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2902
     Options.LoadSizes.push_back(1);
+    // All GPR loads can be unaligned, and vector loads too starting form SSE2.
+    Options.AllowOverlappingLoads = true;
----------------
courbet wrote:
> andreadb wrote:
> > s/form/from.
> > 
> > Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
> > However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
> > One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done... 
> If you look  a few lines above, we only allow expansion with 16 bytes to happen if `ST->hasSSE2()`. 
Cheers. 

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263