[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.
Peter Cordes via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Sat Dec 15 08:00:58 PST 2018
pcordes added inline comments.
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2892-2893
TTI::MemCmpExpansionOptions Options;
// TODO: enable AVX512 when the DAG is ready.
// if (ST->hasAVX512()) Options.LoadSizes.push_back(64);
if (ST->hasAVX2()) Options.LoadSizes.push_back(32);
----------------
courbet wrote:
> spatel wrote:
> > Independent of this patch, but I think this can be enabled. See:
> > rL342989
> Great, I'll look at it in a followup patch.
Be very careful of sprinkling small bits of 512-bit vector stuff into code that isn't already heavily using 512-bit vectors.
It's fine for tune=KNL, but for Skylake-avx512 (and generic with `-mavx512f`) tuning keep in mind that executing one 512-bit vector instruction on Intel Skylake puts the whole core into AVX512 mode, reducing max turbo significantly and shutting down the vector ALUs on port 1. (So `vpaddd` throughput goes down from 3 to 2, for example). And on CPUs without a second 512-bit FMA unit (on port 5 with higher latency) that can be powered up, throughput on FP everything, integer multiplies and shifts, and many other instructions goes down too, even without considering the extra resource conflicts from having fewer vector ALU ports. (e.g. many Xeon Bronze chips have only 512-bit FMA). https://stackoverflow.com/questions/50517770/disable-avx-512-intrinsics-in-vs-2017#comment88055158_50517770
BTW, the integer ALUs on port1 stay active, so it can still run scalar integer stuff like popcnt even when it's shut down for instructions like `pxor`.
----
I believe this happens even from just copying with `vmovdqa64`, even without using any 512-bit ALU instructions like `vpcmpb`.
This can have a significant overall negative impact on code that's mostly scalar, and doesn't have many / any loops that benefit from 512-bit vectors.
----
(Note that 256-bit vectors with AVX512VL can be great, taking advantage AVX512 mask registers and new instructions, and twice as many xmm/ymm registers with ymm16..31.
You can even avoid VZEROUPPER for short non-looping uses of 256-bit registers, like for inline memcmp, by using only those new regs that can't be accesses with legacy SSE. At the minor cost of always needing the longer 4-byte EVEX encoding, not a 2 or 3 byte VEX prefix. Another possible downside is leaving more FPU state dirty for context switches: xsaveopt can omit saving upper halves of YMM regs if they're all clean. And possibly tying up more physical registers, but only vzeroall would avoid that. Each PRF entry is at least 256 bit wide, probably actually 512-bit on Intel CPUs.
But make sure you never omit ZVEROUPPER after using a zmm register, otherwise the Intel CPUs will be stuck with slower-turbo: https://chat.stackoverflow.com/transcript/message/43768745#43768745 even though the port 1 vector ALU shutdown only lasts while 512-bit uops are actually in flight. Actually, BeeOnRope reported that dirtying ZMM16..31 didn't leave max-turbo affected, but using a 512-bit uop would still cause a downclock to power up AVX512 HW so we don't want to randomly use ZMM regs for that reason. Switching clocks takes 10s of thousands of cycles, so this is bad.
Anyway, https://stackoverflow.com/questions/49019614/is-it-useful-to-use-vzeroupper-if-your-programlibraries-contain-no-sse-instruct mentions some of these side-benefits of vzeroupper)
================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2902
Options.LoadSizes.push_back(1);
+ // All GPR loads can be unaligned, and vector loads too starting form SSE2.
+ Options.AllowOverlappingLoads = true;
----------------
andreadb wrote:
> courbet wrote:
> > andreadb wrote:
> > > s/form/from.
> > >
> > > Strictly speaking, SSE1 provides MOVUPS for unaligned vector FP loads.
> > > However, it gets problematic when comparing vectors for equality; using CMPEQPS is not going to work as expected for the case where one of the operands is NaN.
> > > One of your tests shows that the expansion is effectively disabled if the target is SSE but not SSE2. However, as John wrote, I don't see where we check for feature SSE2 is done...
> > If you look a few lines above, we only allow expansion with 16 bytes to happen if `ST->hasSSE2()`.
> Cheers.
The comment is still giving the wrong reason: unaligned loads aren't the problem, lack of SIMD compare instructions are the reason we need SSE2 and AVX2, not SSE1 and AVX1, for 16 and 32-byte expansion of memcmp.
How about:
// All GPR and vector loads can be unaligned. SIMD compare requires integer vectors (SSE2/AVX2)
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D55263/new/
https://reviews.llvm.org/D55263
More information about the llvm-commits
mailing list