[PATCH] D122148: [SLP] Peek into loads when hitting the RecursionMaxDepth

Tue Mar 29 15:25:49 PDT 2022

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:3998
+  if (Depth >= RecursionMaxDepth &&
+      !(S.MainOp && match(S.MainOp, m_ZExtOrSExtOrSelf(m_Load(m_Value()))))) {
     LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
----------------
dmgreen wrote:
> ABataev wrote:
> > dmgreen wrote:
> > > ABataev wrote:
> > > > dmgreen wrote:
> > > > > ABataev wrote:
> > > > > > It might be not profitable, e.g. if vector extension is not free, while scalar extension is free, and loads are not vectorizable. Also, what if you have an extension of a load and some other instructions, not loads/extractelements, etc.?
> > > > > Im not sure I understand what you mean by unprofitable? This just stops zext(load being forced to be gather if it hits the max depth. It should just mean that those node are either better (not gathers) or the same and shouldn't lead to regressions. It was previously hitting an arbitrary limit - you could say the same where any arbitrary limit causes arbitrary problems. Giving the loads the ability to order nicely should be a bigger win.
> > > > > 
> > > > > For the second part - do you mean a AltOp? If so then that makes sense, we can add a check for that, making sure it is the same as the MainOp.
> > > > 1. The cost of vector sext/zext is larger than the cost of scalar sext/zext (which might be free in many cases).
> > > > 2. If S.MainOp is zext/sext(load), it does not mean that all values are zext/sext(load), they might be sext/sext(load,extract,binop,etc.), since you're checking only the mainop.  
> > > Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather).
> > > 
> > > But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it.
> > > 
> > > For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?
> > > Can't that be true for any limit though? We have an arbitrary limit of 12 at the moment. Decreasing the limit to 11 will mean some zext are treated like gathers, not vectorized, and the cost of zexting the loads may be cheaper for scalars than it is for vectors. The same would be true for decreasing the limit to 10 or 9. We would end up picking the limit where we most expect to find zext(load, which is probably very low. (Or just never vectorizing zext(load if the load is gather).
> > 
> > Yes, but we have already some optimizations/numbers/data for this limit. Yes, this limit is not optimal but it is stable, lots of apps shows that it is good enough. Any changes here are very sensitive.
> > 
> > > 
> > > But in general if the loads can be vectorized nicely (either continuously or in clusters) then it should be a gain. The better vectorization of the load would overcome the difference in cost between the scalar and vector zext. We should expect for all the code out there for this to improve performance more than it happens to decrease it.
> > 
> > The problem here is that you're not checking for loads, you just check that you have a single load. It will work for just loads, but for zext/sext it is not.
> > 
> > > 
> > > For the second point, do you have any suggestions? As a simple heuristic, this seemed like nice enough check to me, to balance the complexity of the check vs the expected outcome. Should I make it an all_of?
> > 
> > Yes, better to have all_of here, at least for zex/sext. For loads themselves it is enough just to check S.MainOp
> > 
> I tried the llvm test suite - there were only 2 changes I saw with this patch (according to the file hashes). One was the same code in a different order and the other was essentially the same, just had slightly better reuse of values. Both had the same number of instructions.
> 
> I'm don't think I agree that the old limit was particularly special. It was an arbitrary point, and it would seem that increasing should be beneficial in general. A lot of graphs don't get that big, but the ones that do can benefit from including the loads.
> 
> Can you give more details about where you expect this to cause problems. Is there a particular benchmark you are worried about?
I already told that I tried to increase it it and there were lots of regressions. Try your patch for SpecCPU or other benchmark testsuite.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122148/new/

https://reviews.llvm.org/D122148