[PATCH] D154264: [LV] Skip VFs < iterations remaining for epilogue vectorization.

Sun Jul 2 12:45:22 PDT 2023

Ayal added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5679
       return {ForcedEC, 0, 0};
     else {
       LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization forced factor is not "
----------------
nit (unrelated): avoid else after return.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5709

-  for (auto &NextVF : ProfitableVFs)
+  ScalarEvolution &SE = *PSE.getSE();
+  const SCEV *TC =
----------------
nit: can set `Type *TCType = Legal->getWidestInductionType();` and use it below.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:5722
+      continue;
+
     if (((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
----------------
Worth early-continuing first if (a) !hasPlanWithVF, then if (b) NextVF >= EstimatedRuntimeVF, and last if (c) NextVF >= RemainingIterations?

Note that for IC==1 and non scalable VF's, check (c) subsumes check (b).

Can the checks below be simplified, given that EstimatedRuntimeVF = MainLoopVF if !MainLoopVF.isScalable()?
Instead of checking if Result.Width.isScalar() better check if Result is still uninitialized, i.e., == VectorizationFactor::Disabled()? Or teach isMoreProfitable() to prefer any computed cost over an uncomputed one.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll:866
 ; AVX512-NEXT:    [[TMP8:%.*]] = add nuw nsw i64 [[TMP7]], 8
-; AVX512-NEXT:    [[UGLYGEP:%.*]] = getelementptr i8, ptr [[DEST:%.*]], i64 [[TMP8]]
+; AVX512-NEXT:    [[SCEVGEP:%.*]] = getelementptr i8, ptr [[DEST:%.*]], i64 [[TMP8]]
 ; AVX512-NEXT:    [[TMP9:%.*]] = shl nuw i64 [[TMP6]], 2
----------------
nit: these UGLY-to-SCEV changes are unrelated.

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/limit-vf-by-tripcount.ll:7

 ; TODO: Make sure selected VF for the epilog loop doesn't exceed remaining TC.
 define void @test_tc_17_no_epilogue_vectorization(ptr noalias %src, ptr noalias %dst) #0 {
----------------
This patch addresses this TODO?

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/pr42674.ll:8
 ; a VF=64,UF=4 loop, but the scalar trip count is only 128 so
 ; the vector loop was dead code leaving only a scalar remainder.
 define zeroext i8 @sum() {
----------------
Note that the change in generated code for this test appears to be **indirectly** related to fixing the VF selected for the **epilog** loop.
This loop is vectorized with VF=64, UF=2 both w/ and w/o the patch, yet w/o the patch its epilog loop is also vectorized (with VF=32), which is cleaned up by instcombine and simplifycfg, but the main loop itself remains. Now the epilog loop is not vectorized and the main single-vectorized-iteration loop gets cleaned up.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D154264/new/

https://reviews.llvm.org/D154264