[PATCH] D82227: SLP: honor requested max vector size merging PHIs

Fri Nov 20 08:35:27 PST 2020

jonpa added a comment.

In D82227#2395271 <https://reviews.llvm.org/D82227#2395271>, @vdmitrie wrote:

> In D82227#2389212 <https://reviews.llvm.org/D82227#2389212>, @jonpa wrote:
>
>> I am afraid that this patch actually has a bad impact on performance on SystemZ, and unfortunately this was not known until now. It would be very appreciated if we could rework this and get back the old behaviour on SystemZ somehow...
>>
>> See https://bugs.llvm.org/show_bug.cgi?id=48155
>
> We also observed regression on x86 for imagick after this patch. I'm not sure whether we observed the same case but it was definitely related to this patch.
> Here is what happened:
> SLP vectorizer started from 5 PHIs of i64 type. And it turned out unfortunate to have only the last four of them being perfectly vectorizable with VF=4.
> Max register size blindly cut that list taking the first four of them that were unprofitable to vectorize and thus left the good one even outside of possible try out window. SLP finally end up vectorizing just two of them with VF=2.
> Before the patch the vectorizer after failed attempt to vectorize first four scalars took the next four (last in the list) and succeeded.
> Assuming max vector size is power of two it probably makes sense to cut list at 2*MaxRegSize-1 rather than at MaxRegSize.

I think I am seeing something very similar on SystemZ: The important BB has 5 PHIs of type double, where the first one has different operand opcodes (constant) than the others. Before this patch, that group of 5 was passed to tryToVectorizeList() which first tried 0-3 which failed due to the different operands of PHI#0, but then succeeded with 1-4. This patch changed that so that instead groups of just 2 are passed to tryVectorizeList(), where 0-1 fail, and then it seems some VF=2 vectorization takes place instead.

By using -slp-max-reg-size=320 I get exactly 5 elements and the old behaviour for this example is restored, with the result of <4 x double> instructions. If I instead changed the loop in vectorizeChainsInBlock() to retry on the next instruction in the group ('IncIt = SameTypeIt' => 'IncIt++'), vectorization is now tried with groups [1,2] and [3,4], which I thought might be as good as the [1,2,3,4] group which will be split later anyway. This however did not work: only [3,4] was actually vectorized because [1,2] was considered too expensive. I tried adjusting with a negative value for-slp-threshold, but even though that enabled more vectorization, the performance was not improved at all.

The performance improvement is only present with VF=4, which also adds 3 shufflevectors so it seems that SLP can shuffle a few vectors and produce better code than at VF=2.

> The problem with using costs returned from TTI is exactly this: it was ignored here and vectorization of PHI was trying to grab as much as it could.

It seems that might be simply missing then? tryToVectorizeList() calls getTreeCost() which queries TTI for these costs, so it seems this is the place to increase the cost to avoid the vectorization, or? Could you provide a test case and explain why this approach is not working?

Just because some other place is using this heuristic doesn't mean that it is necessarily optimal - it could very well be the other way around so that it really shouldn't be limited in those places either...

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82227/new/

https://reviews.llvm.org/D82227