[PATCH] D41096: [X86] Initial support for prefer-vector-width function attribute

Tue Dec 12 16:03:45 PST 2017

craig.topper added a comment.

In https://reviews.llvm.org/D41096#952981, @hfinkel wrote:

> I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:
>
> - AVX-512 is twice the length of AVX2, and so using AVX-512 over AVX2 should give a 2x speedup, but...

I want to avoid using the term AVX-512 and AVX2 here and use ZMM and YMM or vector width. There are new instructions introduced after AVX512F as part of the AVX512VL instruction set that use only XMM and YMM registers and are not subject to this frequency issue. Our documentation really doesn't make that clear as it uses "AVX2". Enabling avx512vl subtarget feature implies support for AVX512F and thus the support for 512-bit vectors.

> - Using AVX-512 vs. using AVX2 can decrease the clock rate. For some chips, this could be nearly a ~30% decrease, although it's normally closer to 20% (see, e.g., https://en.wikichip.org/wiki/intel/frequency_behavior). That web page also explains that the clock-speed effect is not for all AVX-512 instructions, but only for complicated ones (e.g., floating-point and integer multiplication).
> - Mixing AVX-512 and AXV2 instructions also has negative throughput effects for the AVX2 instructions, this is explained in section 15.19 of the optimization manual (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf), which says:
> 
>> Skylake microarchitecture has two port schemes, one for using 256-bit or less registers, and another for using 512-bit registers.
>> 
>> When using registers up to or including 256 bits, FMA operations dispatch to ports 0 and 1 and SIMD operations dispatch to ports 0, 1 and 5. When using 512-bit register operations, both FMA and SIMD operations dispatch to ports 0 and 5.
>>  The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme. Notice that when you use AVX-512 encoded instructions with YMM registers, the instructions are considered to be 256-bit wide.
>> 
>> The result of the 512-bit port scheme is that XMM or YMM code dispatches to 2 ports (0 and 5) instead of 3 ports (0, 1 and 5) and may have lower throughput and longer latency compared to the 256-bit port scheme.
> 
> 
> 
> - But there's an additional complication, discussed in 15.20 of the optimization manual:
> 
>> Some processors based on Skylake microarchitecture have two Intel AVX-512 FMA units, on ports 0 and 5, while other processors based on Skylake microarchitecture have a single Intel AVX-512 FMA unit, which is located on port 0.
>> 
>> Code that is optimized to run on a processor with two FMA units might not be optimal when run on a processor with one FMA unit.
> 
> If these are the relevant causes of the problem, then I suggest that we do the following:
> 
> 1. We really have two different Skylake cores for optimization purposes: The ones that execute AVX-512 only on port 0/1, and the ones that also can execute AVX-512 on port 5. We should stop pretending that these are the same cores, calling them both skylake-avx512, and call the former something else (say skylake-avx512-1p). These cores should have different scheduling models. We should benchmark these differently, and for the port-0-only variant, we should increase the TTI costs for all AVX-512 instruction by a factor of 2 (which is right because TTI returns reciprocal throughputs for vectorization), or maybe a little more than 2 (to account for the extra ILP constraints) if necessary.

Yes there are two variants of sklake-avx512, but there doesn't seem to be a good way of autodetecting this for march=native.

In general, I don't think our cost models distinquish based on CPUs do they? Aren't they based only on subtarget features?

> 2. We need a way for the X86 backend to penalize mixing 512-bit vectors with smaller vector types. It can do this by increasing the cost of the smaller non-FMA vector operations, when 512-bit vectors are around, by 30% (to account for the fact that, when 512-bit instructions are around, the peak throughput of those instructions is decreased from 3/cycle to 2/cycle). We can add a TTI interface that allows the target to analyze the loop, generate some state object, and then use that state object when generating the per-instruction costs.
> 3. We need a way for the target to specify a penalty factor for a loop (etc.) when vectorizing. Right now, if there appears to be a positive speedup, then the associated vectorization factor is chosen. To adjust for the clock-rate decrease. If we're generating 512-bit instructions, we should apply a penalty factor of, say, 0.7, so that estimating that vectorization will be profitable includes the effect of the (potentially) decreasing clock rate.
> 
>   All of that might not be enough, however, because the clock-rate effects are not entirely local. We could have an absolute cap for small-trip-count loops, and for SLP vectorization. For the loop vectorizer, look at `LoopVectorizationCostModel::computeFeasibleMaxVF` (also, I'll take back something I said: picking a vectorization factor based on the smallest type, not the largest one, doesn't seem to be enabled by default right now, because -vectorizer-maximize-bandwidth is false by default, although it looks like we'd like it to be on, see r306936). There are a couple of places in the SLP vectorizer where the VF is computed, and capping those seems straightforward.

Yeah the VF factor is calculated by the largest scalar type of loads, stores, and phis I think.

> Now many none of this really helps, because you end up with loops with dynamically-small trip counts, where the vectorization speedup is negligible, but the presence of the wide vector instructions causes clock-rate decreases. Maybe you can't even reliably multiversion and branch around the vector code (because even speculatively executing the vector instructions triggers the problem), then we need to decide how much we care about these cases vs. speedups in other areas. However, I think that we should start by modeling what we can model, and then evaluate things from there.

As you said the effect is not local because once you trigger it there is a timer for the penalty to be in effect. I'm sure speculation would trigger it too since this penalty is about power delivery to the execution units.

I appreciate the cost modeling suggestions and I think there could be a good long term solution in doing that, but I think that will require a lot more tuning effort and its unclear if that could be made to work.

What Intel wants to see implemented right now is a way to remove as much zmm register usage as possible by default on skylake-avx512 without losing the avx512vl capabilities. If the enabling of avx512vl didn't automatically imply the availablity of avx512f and 512-bit intrinsics we would probably just turn the 512 bit support off by default in the legalizer very easily. But the dependencies don't work that way.

https://reviews.llvm.org/D41096