[PATCH] D122148: [SLP] Peak into loads when hitting the RecursionMaxDepth

Tue Mar 29 03:05:40 PDT 2022

dmgreen added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4105
+  // Gather if we hit the RecursionMaxDepth, unless this is a load (or z/sext of
+  // a load), in which case peak through to include it in the tree, without
+  // ballooning over-budget.
----------------
ktkachov wrote:
> typo nit: should be "peek"
Oh right, yeah. Cheers. I'll fix that.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3998
+  if (Depth >= RecursionMaxDepth &&
+      !(S.MainOp && match(S.MainOp, m_ZExtOrSExtOrSelf(m_Load(m_Value()))))) {
     LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
----------------
ABataev wrote:
> dmgreen wrote:
> > ABataev wrote:
> > > It might be not profitable, e.g. if vector extension is not free, while scalar extension is free, and loads are not vectorizable. Also, what if you have an extension of a load and some other instructions, not loads/extractelements, etc.?
> > Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced to be gather if it hits the max depth. It should just mean that those node are either better (not gathers) or the same and shouldn't lead to regressions. It was previously hitting an arbitrary limit - you could say the same where any arbitrary limit causes arbitrary problems. Giving the loads the ability to order nicely should be a bigger win.
> > 
> > For the second part - do you mean a AltOp? If so then that makes sense, we can add a check for that, making sure it is the same as the MainOp.
> 1. The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be free in many cases).
> 2. If S.MainOp is zext/sext(load), it does not mean that all values are zext/sext(load), they might be sext/sext(load,extract,binop,etc.), since you're checking only the mainop.  
Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather).

But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it.

For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122148/new/

https://reviews.llvm.org/D122148