[PATCH] D82227: SLP: honor requested max vector size merging PHIs

Wed Nov 11 12:38:50 PST 2020

rampitec added a comment.

In D82227#2389389 <https://reviews.llvm.org/D82227#2389389>, @jonpa wrote:

> In D82227#2389284 <https://reviews.llvm.org/D82227#2389284>, @rampitec wrote:
>
>> The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?
>
> Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

Well, probably the name of the callback does not really reflect its use. The actual use if the width of the vectorization required. If used with Vector = true it affects exactly two places: it sets the width of the vectorization for loop and slp.
Earlier in the comments there seems to be a consensus that a target which want wider vectorization shall really return a bigger number form getRegisterBitWidth().

> In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...?  Why isn't it enough to use TTI?

The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

> Why would it make sense to only vectorize to <2 x double> and not <4 x double>?  The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>....  I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...
>
> I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

Making it wider than we can actually lower is bad in two ways:

1. It eliminated a possibility to deadcode dead lanes.
2. What was much more important it requires an allocation of a wider register. In our case it was literally asking for registers 1024 bits wide (yes, we can have such tuples), and that leads to spilling and even inability to allocate registers in some cases.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82227/new/

https://reviews.llvm.org/D82227