[PATCH] D41096: [X86] Initial support for prefer-vector-width function attribute

Tue Dec 12 15:05:12 PST 2017

hfinkel added a comment.

In https://reviews.llvm.org/D41096#952829, @craig.topper wrote:

> How do you propose to control the cap? I don't think we want to default it to 256 for skx as that would make our codegen worse(or at the very least very different) from our avx2 codegen.

I really do want to make sure that I understand the problem (before I continue suggesting solutions). To summarize:

- AVX-512 is twice the length of AVX2, and so using AVX-512 over AVX2 should give a 2x speedup, but...
- Using AVX-512 vs. using AVX2 can decrease the clock rate. For some chips, this could be nearly a ~30% decrease, although it's normally closer to 20% (see, e.g., https://en.wikichip.org/wiki/intel/frequency_behavior). That web page also explains that the clock-speed effect is not for all AVX-512 instructions, but only for complicated ones (e.g., floating-point and integer multiplication).
- Mixing AVX-512 and AXV2 instructions also has negative throughput effects for the AVX2 instructions, this is explained in section 15.19 of the optimization manual (https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf), which says:

> Skylake microarchitecture has two port schemes, one for using 256-bit or less registers, and another for using 512-bit registers.
> 
> When using registers up to or including 256 bits, FMA operations dispatch to ports 0 and 1 and SIMD operations dispatch to ports 0, 1 and 5. When using 512-bit register operations, both FMA and SIMD operations dispatch to ports 0 and 5.
>  The maximum register width in the reservation station (RS) determines the 256 or 512 port scheme. Notice that when you use AVX-512 encoded instructions with YMM registers, the instructions are considered to be 256-bit wide.
> 
> The result of the 512-bit port scheme is that XMM or YMM code dispatches to 2 ports (0 and 5) instead of 3 ports (0, 1 and 5) and may have lower throughput and longer latency compared to the 256-bit port scheme.

- But there's an additional complication, discussed in 15.20 of the optimization manual:

> Some processors based on Skylake microarchitecture have two Intel AVX-512 FMA units, on ports 0 and 5, while other processors based on Skylake microarchitecture have a single Intel AVX-512 FMA unit, which is located on port 0.
> 
> Code that is optimized to run on a processor with two FMA units might not be optimal when run on a processor with one FMA unit.

If these are the relevant causes of the problem, then I suggest that we do the following:

1. We really have two different Skylake cores for optimization purposes: The ones that execute AVX-512 only on port 0/1, and the ones that also can execute AVX-512 on port 5. We should stop pretending that these are the same cores, calling them both skylake-avx512, and call the former something else (say skylake-avx512-1p). These cores should have different scheduling models. We should benchmark these differently, and for the port-0-only variant, we should increase the TTI costs for all AVX-512 instruction by a factor of 2 (which is right because TTI returns reciprocal throughputs for vectorization), or maybe a little more than 2 (to account for the extra ILP constraints) if necessary.
2. We need a way for the X86 backend to penalize mixing 512-bit vectors with smaller vector types. It can do this by increasing the cost of the smaller non-FMA vector operations, when 512-bit vectors are around, by 30% (to account for the fact that, when 512-bit instructions are around, the peak throughput of those instructions is decreased from 3/cycle to 2/cycle). We can add a TTI interface that allows the target to analyze the loop, generate some state object, and then use that state object when generating the per-instruction costs.
3. We need a way for the target to specify a penalty factor for a loop (etc.) when vectorizing. Right now, if there appears to be a positive speedup, then the associated vectorization factor is chosen. To adjust for the clock-rate decrease. If we're generating 512-bit instructions, we should apply a penalty factor of, say, 0.7, so that estimating that vectorization will be profitable includes the effect of the (potentially) decreasing clock rate.

All of that might not be enough, however, because the clock-rate effects are not entirely local. We could have an absolute cap for small-trip-count loops, and for SLP vectorization. For the loop vectorizer, look at `LoopVectorizationCostModel::computeFeasibleMaxVF` (also, I'll take back something I said: picking a vectorization factor based on the smallest type, not the largest one, doesn't seem to be enabled by default right now, because -vectorizer-maximize-bandwidth is false by default, although it looks like we'd like it to be on, see r306936). There are a couple of places in the SLP vectorizer where the VF is computed, and capping those seems straightforward.

Now many none of this really helps, because you end up with loops with dynamically-small trip counts, where the vectorization speedup is negligible, but the presence of the wide vector instructions causes clock-rate decreases. Maybe you can't even reliably multiversion and branch around the vector code (because even speculatively executing the vector instructions triggers the problem), then we need to decide how much we care about these cases vs. speedups in other areas. However, I think that we should start by modeling what we can model, and then evaluate things from there.

https://reviews.llvm.org/D41096