[PATCH] D89566: [LV] Epilogue Vectorization with Optimal Control Flow

Fri Nov 27 09:09:31 PST 2020

dmgreen added inline comments.

================
Comment at: llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll:292

 define hidden void @pointer_phi_v16i8_add1(i8* noalias nocapture readonly %A, i8* noalias nocapture %B, i32 %y) {
 ; CHECK-LABEL: @pointer_phi_v16i8_add1(
----------------
bmahjour wrote:
> dmgreen wrote:
> > I was surprised to see an MVE test like this chose to try and epilogue vectorize. I had presumed that would not happen on MVE - we only have a single vector width with no interleaving - the benefit of trying to do a single <8 x i8> iterations after a <16 x i8> main loop is not going to be worth the additional branching/setup we have to do, unfortunately.  I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial.
> > 
> > I don't believe that MVE is a vector target that would ever benefit from epilogue vectorization, unfortunately. Can we get some sort of target hook that allows us to disable it? Perhaps something that sets a maximum epilogue vectorization factor given a VF * UF main loop? That would allow us to set it to none, whilst others tune it for their needs, like possibly always having the fallback as a 64bit vector under aarch64 (just a though, not sure if that's best idea or not but it at least allows targets to tune things).
> > I ran some extra tests and added a mve-qabs.ll test, where again the <16 x i8> loop is getting a remainder where it isn't beneficial.
> 
> Is it degrading performance or just not beneficial (harmless)? As I mentioned before the heuristic in this patch is not very good, but putting the cost-modeling in the critical path for getting the codegen implemented is also not desirable. I had suggested to disable this transformation by default until a proper cost-model is implemented, to which some people disagreed.
> 
> In order to come up with a meaningful target hook it would be helpful to know what machine characteristics in MVE cause epilogue vectorization to not be beneficial. Are there existing TTI hooks that we can use (eg. `getMaxInterleaveFactor() > 1`)?
> Is it degrading performance or just not beneficial (harmless)?

Degrading performance unfortunately. It doesn't happen in a lot of tests, but it was between a 0% and 25% decrease, apparently. The M in MVE stands for microcontroller (umm, I think), so it can be somewhat constrained and can be especially hurt by inefficient codegen, that would not be as bad on other cores/vector architectures.

The max interleaving factor will be 1 for any MVE target. They also only have 128 bit vectors, no 64bit wide vectors that would be present in NEON. Essentially that means that for i8 we would either vectorize x 16 (which is excellent), or x 8 using extends if we can't for x16. The x8 would be beneficial on it's own with enough iterations I think, but doing a single iteration at x 8 does not overcome the additional cost from outside the vector loops.

Using getMaxInterleaveFactor to limit this for the moment would work for me. I have no strong opinions on enabling this by default or not, but you may want the very initial commit to default to false with a commit soon after to enable it. That way if someone does revert this, at least they are only reverting the flipping of the switch and not the whole patch.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D89566/new/

https://reviews.llvm.org/D89566