[PATCH] D111029: [X86] Prefer 512-bit vectors on Ice/Rocket/TigerLake (PR48336)

Mon Oct 4 15:26:55 PDT 2021

pcordes added a comment.

Probably good for -march=icelake-client.  We need to check on -march=icelake-server (launched April 2021), ideally trying it with some benchmarks where only some parts of the code can auto-vectorize.

That's always the concern: one small part of a problem hurting frequency for the rest.  Or for one short-lived process lowering frequency for other processes on the same physical core.  Server CPUs are maybe more likely than client to have use-cases where multiple programs are running with some waking up for a short interval, so having one program hurt CPU frequency for everything else is maybe more likely to be a concern there.

Any frequency change at all has a cost in stopping the CPU clock until the new voltage settles (like 8.5 microseconds = 8500 nanoseconds), but yes, it seems on those client chips, there isn't one between scalar and light-512 or even heavy-512, at least with multiple cores active.  I think the HW power management keeps the CPU at the new frequency for long enough that the worst case isn't a huge number of transitions per second, though. 
 https://stackoverflow.com/questions/45472147/lost-cycles-on-intel-an-inconsistency-between-rdtsc-and-cpu-clk-unhalted-ref-ts re: clock halt time on switching frequency on Skylake.

Travis's article didn't have test results from -march=icelake-server Xeon chips; that's something to check on.
If they still only have one 512-bit FMA unit per core, the max power for FMA-heavy workloads (the most power-intensive thing you can do) is not much higher than with AVX2 keeping both halves of the FMA unit busy separately.  (But wider cache accesses can mean a bit more power).  Anyway, there's reason to hope that Ice Lake server might be similarly not too badly affected.

Having any 512-bit uops in flight (or something like that) also shuts down the vector ALU on port 1.  But AFAIK that recovers quickly, and is usually still worth it.  For something that bottlenecks on SIMD integer stuff like bitwise booleans, 2x work per uop * 2/3 uop throughput is still a factor of 4/3 = 1.333 speedup for booleans/vpternlogd, and stuff like integer addition.  And most things will hopefully benefit more than that, and having scalar integer and load/store uops in the pipeline often means that you don't bottleneck just on back-end ALU ports.  (Wider loads/stores often means fewer other uops, so helping with other bottlenecks, and out-of-order exec can see farther).

For something that bottlenecks on front-end throughput, we could expect more like 2x speedup.

Ice Lake has a shuffle unit on port 1 that can handle some common 128/256-bit shuffles like `vpshufb ymm`, but not `zmm`.  If auto-vectorization ends up needing a lot of shuffling, 256-bit might be a win.  And also, with really wide vectors, it's easier for a bad vectorization strategy to get really bad.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111029/new/

https://reviews.llvm.org/D111029