[llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types

Wed Jun 15 23:24:23 PDT 2016

Hi Michael, 

Thank you for working on this. The loop vectorizer tries a bunch of different vectorization factors and stops at the widest word size mostly because of compile time concerns. On every vectorization factors that we check we have to scan all of the instructions in the loop and make multiple calls into TTI. If you decide to increase the VF enumeration space then you will linearly increase the compile time of the loop vectorizer. I think that it would be a good idea to explore this compile-time vs performance tradeoff with numbers.

The cost model is designed to be a fast approximation of SelectionDAG. We don't want to duplicate every optimization in SelectionDAG into the cost model because this would make the code model (and the optimizer) difficult to maintain. If the cost model does not represent an operation that you care about then you should add it to the cost tables. 

I don't understand how selecting wide vectors would eliminate the need to have loop widening.  Loop widening happens to break data dependencies and allow more parallelism. If you have two independent arithmetic operations then they can go into different execution units, or to pipelined execution units. Your mixed-typed loops would cause a shuffle across registers (which we can't model well in the cost model, for obvious reasons) that will pack multiple lanes into a smaller vector and this would introduce a data dependency. 

Maybe you should start by increasing the enumeration space (by 2X, for example) under a flag and see if you get any performance gains. 

-Nadav

On Jun 15, 2016, at 03:48 PM, Michael Kuperstein <mkuper at google.com> wrote:

Hello,

Currently the loop vectorizer will, by default, not consider vectorization factors that would make it generate types that do not fit into the target platform's vector registers. That is, if the widest scalar type in the scalar loop is i64, and the platform's largest vector register is 256-bit wide, we will not consider a VF above 4.

We have a command line option (-mllvm -vectorizer-maximize-bandwidth), that will choose VFs for consideration based on the narrowest scalar type instead of the widest one, but I don't believe it has been widely tested. If anyone has had an opportunity to play around with it, I'd love to hear about the results.

What I'd like to do is:
Step 1: Make -vectorizer-maximize-bandwidth the default. This should improve the performance of loops that contain mixed-width types.
Step 2: Remove the artificial width limitation altogether, and base the vectorization factor decision purely on the cost model. This should allow us to get rid of the interleaving code in the loop vectorizer, and get interleaving for "free" from the legalizer instead.

There are two potential road-blocks I see - the cost-model, and the legalizer. To make this work, we need to:
a) Model the cost of operations on illegal types better. Right now, what we get is sometimes completely ridiculous (e.g. see http://reviews.llvm.org/D21251).
b) Make sure the cost model actually stops us when the VF becomes too large. This is mostly a question of correctly estimating the register pressure. In theory, that should not be a issue - we already rely on this estimate to choose the interleaving factor, so using the same logic to upper-bound the VF directly shouldn't make things worse.
c) Ensure the legalizer is up to the task of emitting good code for overly wide vectors. I've talked about this with Chandler, and his opinion (Chandler, please correct me if I'm wrong) is that on x86, the legalizer is likely to be able to handle this. This may not be true for other platforms. So, I'd like to try to make this the default on a platform-by-platform basis, starting with x86.

What do you think? Does this seem like a step in the right direction? Anything important I'm missing?

Thanks,
  Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/2fe93d3b/attachment.html>