[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Tue Dec 18 04:40:33 PST 2018

davezarzycki added inline comments.

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2892-2893
     TTI::MemCmpExpansionOptions Options;
     // TODO: enable AVX512 when the DAG is ready.
     // if (ST->hasAVX512()) Options.LoadSizes.push_back(64);
     if (ST->hasAVX2()) Options.LoadSizes.push_back(32);
----------------
pcordes wrote:
> courbet wrote:
> > spatel wrote:
> > > Independent of this patch, but I think this can be enabled. See:
> > > rL342989
> > Great, I'll look at it in a followup patch.
> Be very careful of sprinkling small bits of 512-bit vector stuff into code that isn't already heavily using 512-bit vectors.
> 
> It's fine for tune=KNL, but for Skylake-avx512 (and generic with `-mavx512f`) tuning keep in mind that executing one 512-bit vector instruction on Intel Skylake puts the whole core into AVX512 mode, reducing max turbo significantly and shutting down the vector ALUs on port 1.  (So `vpaddd` throughput goes down from 3 to 2, for example).  And on CPUs without a second 512-bit FMA unit (on port 5 with higher latency) that can be powered up, throughput on FP everything, integer multiplies and shifts, and many other instructions goes down too, even without considering the extra resource conflicts from having fewer vector ALU ports.  (e.g. many Xeon Bronze chips have only 512-bit FMA).  https://stackoverflow.com/questions/50517770/disable-avx-512-intrinsics-in-vs-2017#comment88055158_50517770
> 
> BTW, the integer ALUs on port1 stay active, so it can still run scalar integer stuff like popcnt even when it's shut down for instructions like `pxor`.
> 
> ----
> 
> I believe this happens even from just copying with `vmovdqa64`, even without using any 512-bit ALU instructions like `vpcmpb`.
> 
> This can have a significant overall negative impact on code that's mostly scalar, and doesn't have many / any loops that benefit from 512-bit vectors.  
> 
> ----
> 
> (Note that 256-bit vectors with AVX512VL can be great, taking advantage AVX512 mask registers and new instructions, and twice as many xmm/ymm registers with ymm16..31.
> 
> You can even avoid VZEROUPPER for short non-looping uses of 256-bit registers, like for inline memcmp, by using only those new regs that can't be accesses with legacy SSE.  At the minor cost of always needing the longer 4-byte EVEX encoding, not a 2 or 3 byte VEX prefix.  Another possible downside is leaving more FPU state dirty for context switches: xsaveopt can omit saving upper halves of YMM regs if they're all clean.  And possibly tying up more physical registers, but only vzeroall would avoid that.  Each PRF entry is at least 256 bit wide, probably actually 512-bit on Intel CPUs.
> 
> But make sure you never omit ZVEROUPPER after using a zmm register, otherwise the Intel CPUs will be stuck with slower-turbo: https://chat.stackoverflow.com/transcript/message/43768745#43768745 even though the port 1 vector ALU shutdown only lasts while 512-bit uops are actually in flight.  Actually, BeeOnRope reported that dirtying ZMM16..31 didn't leave max-turbo affected, but using a 512-bit uop would still cause a downclock to power up AVX512 HW so we don't want to randomly use ZMM regs for that reason.  Switching clocks takes 10s of thousands of cycles, so this is bad.
> 
>  Anyway, https://stackoverflow.com/questions/49019614/is-it-useful-to-use-vzeroupper-if-your-programlibraries-contain-no-sse-instruct mentions some of these side-benefits of vzeroupper)
Hi @pcordes – Just FYI, the compiler already sprinkles in AVX512 for `memcpy` and `memset`. Also, auto-vectorization can sprinkle in unprofitable AVX512 code. From what I've seen, the advice seems to be: use `-mno-avx512f` if the sprinkled results aren't profitable.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263