[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.
Andrea Di Biagio via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Dec 6 04:54:51 PST 2018
andreadb added a comment.
In D55263#1321184 <https://reviews.llvm.org/D55263#1321184>, @pcordes wrote:
> In D55263#1320043 <https://reviews.llvm.org/D55263#1320043>, @spatel wrote:
>
> > I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
> > <snip>
> >
> > 2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?
>
>
> *Unaligned* loads are a potential minor slowdown if they cross cache-line boundaries. (Or on AMD, maybe even 32-byte or even 16-byte boundaries). There is literally zero penalty when they don't cross any relevant boundary on modern CPUs (on Intel, that's 64-byte cache lines).
+1
On AMD, a misaligned store or load operation suffers a minimum one-cycle penalty if it crosses a boundary (definitely 16-byte for AMD Family 15h/16h processors).
I also agree with Peter when he says that it is still worth it to pay a small penalty vs. doing a memcpy library call.
Speaking about memcpy: It is worth mentioning that - at least on Jaguar - the LS knows how to minimize the impact of stalls due to repeated misaligned accesses. Quoting the AMDfam15h SOG: "Writes to the Data cache which are unaligned in an "address" are written in two cycles. If consecutive unaligned addressed 128-bit loads are written they can be coalesced such that the 64-bit portions of 128-bit writes which were unaligned can be merged and written 128-bits at a time, removing most the stall penalties. This is performed in the Store Coalescing Buffer (SCB)."
Back on topic:
For x86, I think this patch is a nice improvement. Not sure about other targets.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2902
Options.LoadSizes.push_back(1);
+ // All GPR loads can be unaligned, and vector loads too starting form SSE2.
+ Options.AllowOverlappingLoads = true;
----------------
courbet wrote:
> andreadb wrote:
> > s/form/from.
> >
> > Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
> > However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
> > One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done...
> If you look a few lines above, we only allow expansion with 16 bytes to happen if `ST->hasSSE2()`.
Cheers.
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D55263/new/
https://reviews.llvm.org/D55263
More information about the llvm-commits
mailing list