[PATCH] D30247: Epilog loop vectorization

Wed Feb 22 03:29:59 PST 2017

rengolin added a comment.

I think this is an interesting idea, though with limited applicability. Furthermore, the way you implemented makes it completely orthogonal to the vectoriser, and at odds with the `VPlan` strategy being discussed.

Before VPlans get in, I'd assume you would sort the strategies by cost and, if they were in a beneficial order (ex. 4 > 2 > 1), you'd then split the loop in N parts, one for each size. The problem here is that the trade-offs are not clear, and this is probably only beneficial for *very* large VF (32+), because now you're adding more run time checks, shuffles and moves between scalar and vector register banks, which are not always free.

A few more ideas...

1. If the second loop needs to be >16, then just unroll however many instructions to match 16 lanes, no loops necessary.
2. Link this optimisation to code size restrictions. We don't wan't to run this at Os.
3. Once VPlans go in, this would be a separate VPlan that could be applied on top of others, so we may need to change the VPlan implementation to allow that.

Finally, tests. Even this just being a proposal, with tests you can show your proposal "in action" and allow us to discuss specific details of the pass that could be too opaque or intricate to realise just looking at the code.

cheers,
--renato

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3345
+  }
+  DEBUG(dbgs() << "Epilog vectorization is beneficial with width : "
+               << EpilogVectorLoopWidth << " in Function: "
----------------
So, you're assuming that just having more than 16 iterations is "beneficial", but you haven't done any cost analysis. It may very well be that the cost is just not worth it, especially for smaller vector sizes.

Repository:
  rL LLVM

https://reviews.llvm.org/D30247