[PATCH] D82227: SLP: honor requested max vector size merging PHIs

Wed Nov 11 11:15:03 PST 2020

jonpa added a comment.

In D82227#2389284 <https://reviews.llvm.org/D82227#2389284>, @rampitec wrote:

> In D82227#2389258 <https://reviews.llvm.org/D82227#2389258>, @jonpa wrote:
>
>> In D82227#2389246 <https://reviews.llvm.org/D82227#2389246>, @rampitec wrote:
>>
>>> In D82227#2389212 <https://reviews.llvm.org/D82227#2389212>, @jonpa wrote:
>>>
>>>> I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...
>>>>
>>>> See https://bugs.llvm.org/show_bug.cgi?id=48155
>>>
>>> I believe if a target wants to have wider vectors it needs to increase the size returned from its TTIImpl::getRegisterBitWidth(). Can you try increasing a return from SystemZTTIImpl::getRegisterBitWidth()?
>>
>> I did try to override this with -slp-max-reg-size, and that works... However getRegisterBitWidth() is also used by other passes, like the LoopVectorizer, so it seems wrong to change that value as it is defined just for the purpose of tuning a particular optimization...
>
> The other pass which calls getRegisterBitWidth(true) is LoopVectorize. Do you mean you want to have different heuristics for loop and straight-line vectorization?

Well, the definition of that hook per the comment is "The width of the largest scalar or vector register type", so I don't see how it could be a variable to play with. It should simply reflect the size of the vector register - 128 bits for SystemZ.

In the original discussion there was a suggestion to look into the TTI costs on your target for those very wide vector types, a <32 x ...> PHI instruction...?  Why isn't it enough to use TTI?

Why would it make sense to only vectorize to <2 x double> and not <4 x double>?  The latter is just 2 vector regs, and that is completely fine... In my case it is obvious that the final result of the vectorizer is greatly improved by allowing an over-wide vector type, even though in the most simple case 2 x <2 x double> should give the same output as a split <4 x double>....  I am not sure yet exactly why this makes for many more vector fp-add/fp-mul in the output... Note that with your patch those instructions are not vectorized at all anymore, but are left scalar! So there is some vectorization that is lost by always doing max <2 x double> and never wider...

I wonder why is it better to do 2 x <2 x double> rather than <4 x double>, they will both use two vector registers... (not just for PHIs, but generally)?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82227/new/

https://reviews.llvm.org/D82227