[PATCH] D82227: SLP: honor requested max vector size merging PHIs

Stanislav Mekhanoshin via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Nov 20 12:45:59 PST 2020


rampitec added a comment.

In D82227#2408693 <https://reviews.llvm.org/D82227#2408693>, @jonpa wrote:

>> It did not work for the reduced test I have submitted with the change for a simple reason: it was not checked.
>
> Yeah - but if you have it and post it here we can work together on it...  Maybe an .ll/.bc file with a runline which gives in the output what you need to avoid. Maybe even an llc runline on that to show the spilling...

It has been really long time ago. I was trying to find the original bc from the failing app, but cannot anymore :(

>> I can understand the argument that controlling it with register size might not be a best approach. In this case we can just expose another target callback, specifically for the vectorization purposes.
>
> Would it make sense to have SLP first try the full group, and then as long as it's not profitable reiterate with half of the previous group size? In other words first try 32 in your case, and then start over with a max of 16, then 8, all the way down to 2 unless TTI costs returned a profitable total cost? That is one idea.. An alternative might be to have SLP look at the tree it wants to convert and do a register pressure estimate and add that to the total cost...  This is assuming that greater VFs are beneficial, which I at least think they are at the moment...

That is if you believe that a wider vectorization is a bonus. It might be on some targets, it is definitely not for AMDGPU. We have very wide register tuples, but really only 2 element vector ALU instructions (and 4 element vector loads and stores). Nonetheless since we have these wide registers RA would use them thus increasing register pressure. The generated code will use subregs of these wide tuples. In fact just by returning twice wider result for the vector register size I see increase in the number of consumed registers which directly lowers the performance.

Then for a target which does not have such registers there is no such option and vectors will be split at lowering. And that's the real difference here.

If for some reason a vector register width is not good enough driver for the vectorization I would rather create yet another target callback. It just happened that it is register width is currently used across the llvm to control it, but we can change it.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D82227/new/

https://reviews.llvm.org/D82227



More information about the llvm-commits mailing list