[PATCH] D115261: [LV] Disable runtime unrolling for vectorized loops.

Tue Aug 16 09:10:48 PDT 2022

fhahn updated this revision to Diff 453036.
fhahn added a comment.
Herald added a subscriber: zzheng.
Herald added a project: All.

In D115261#3188312 <https://reviews.llvm.org/D115261#3188312>, @dmgreen wrote:

>> I believe many targets already disable runtime unrolling for loops that contain vector instructions. For example AArch64 does that, though X86 currently does not. This is the principal alternative I would see, to move that logic up into the generic unroll preferences. It would be the difference between not unrolling loops that LLVM vectorized and not unrolling vector loops in general -- I assume the preference would be the former, as this patch does?
>
> In ARM-MVE we do already, both for vectorized code from the vectorizer and vector code from the frontend (intrinsics for example). That is largely to do with the way MVE handles tail-predication and hardware loops, which makes runtime unrolling generally detrimental.
>
> For AArch64 we inherit some of the unrolling preferences from the base class, so I would expect this to alter in-order and out of order unrolling decisions to some degree. The main reasons we disabled the extra unrolling of loops with vectors was to not over-unroll loops past the point that it is useful - as it if you have a loop acting on i8's that is vectorized x16, then "interleaved" x2 or x4, then further runtime unrolled, you end up with a fast loop body that handles 64 or 128 or even 256 elements at a time. For too many tripcounts you just don't enter that loop or do most of the processing in the remainder loop. (Also we didn't want to break a lot of the hand-crafted intrinsic code that is out there that already unrolls the loop to carefully fit into the number of registers available).

That's a good point. AFAICT AArch64 disables unrolling of loops with vector instructions in general (not just runtime unrolling, which is still not enabled by default in general IIRC). I guess an alternative would be to do this generally (or at least for x86 as well),, as I sketched in D131972 <https://reviews.llvm.org/D131972>. It would probably also make sense to only do it for out-of-order CPUs for now.

But the general assumption remains: If LV didn't decide to interleave to increase parallelism, it is extremely unlikely LoopUnroll will make a better informed discussion later on.

>> Note that we already add metadata to disable runtime unrolling to the scalar loop after vectorization.
>
> We only currently add no-unroll metadata to the remainder when there are no runtime-checks added. I wouldn't be surprised if a lot of the codesize gain from this patch is due to this patch adding no-unroll to the remainder, not to the vector body. It is quite easy to construct cases where the remainder really should be allowed to unroll.
>
> LLVM has never had very good end-to-end testing for this kind of thing and has relied upon benchmarks like the llvm-test-suite to fill that gap. The noise makes it difficult, but I would expect a descent amount of benchmarking to prove a change like this is better overall and to catch a lot of the cases where it is not.

I updated the patch to leave the metadata of the scalar loop untouched and adds metadata to disable any unrolling of the vector loop. This should effectively be the same as effect as D131972 <https://reviews.llvm.org/D131972>, although compile-time is slightly better as LoopUnroll exits earlier (geoman reduction with this patch for -O3 is -1.38% vs -1.18%).

If we decide to predicate this on out-of-order vs in-order, I am currently leaning on making this decision in getUnrollingPreferences in TTI.

There is further feedback that aggressive unrolling of vector code on X86 isn't beneficial: https://github.com/llvm/llvm-project/issues/42332

https://llvm-compile-time-tracker.com/compare.php?from=a4a2ac5d1878177b57b76b109fda3820c6939a28&to=dca1dc4332b14064cf7f7618de58f2407b52c805&stat=instructions
https://llvm-compile-time-tracker.com/compare.php?from=94d21a94d90db8bc0e983bde672790843f81ddca&to=a908a561c4639f45d29b43dd921fee0b24b42dfb&stat=instructions

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115261/new/

https://reviews.llvm.org/D115261

Files:
  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
  llvm/test/Transforms/LoopVectorize/AArch64/extractvalue-no-scalarization-required.ll
  llvm/test/Transforms/LoopVectorize/ARM/pointer_iv.ll
  llvm/test/Transforms/LoopVectorize/ARM/tail-folding-loop-hint.ll
  llvm/test/Transforms/LoopVectorize/X86/already-vectorized.ll
  llvm/test/Transforms/LoopVectorize/X86/float-induction-x86.ll
  llvm/test/Transforms/LoopVectorize/X86/gather_scatter.ll
  llvm/test/Transforms/LoopVectorize/X86/invariant-store-vectorization.ll
  llvm/test/Transforms/LoopVectorize/X86/masked_load_store.ll
  llvm/test/Transforms/LoopVectorize/X86/metadata-enable.ll
  llvm/test/Transforms/LoopVectorize/X86/nontemporal.ll
  llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll
  llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll
  llvm/test/Transforms/LoopVectorize/X86/vect.omp.force.small-tc.ll
  llvm/test/Transforms/LoopVectorize/followup.ll
  llvm/test/Transforms/LoopVectorize/if-pred-non-void.ll
  llvm/test/Transforms/LoopVectorize/induction.ll
  llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll
  llvm/test/Transforms/LoopVectorize/invariant-store-vectorization-2.ll
  llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll
  llvm/test/Transforms/LoopVectorize/memdep-fold-tail.ll
  llvm/test/Transforms/LoopVectorize/optsize.ll
  llvm/test/Transforms/LoopVectorize/pointer-select-runtime-checks.ll
  llvm/test/Transforms/LoopVectorize/reduction-with-invariant-store.ll
  llvm/test/Transforms/LoopVectorize/vectorize-once.ll
  llvm/test/Transforms/PhaseOrdering/X86/excessive-unrolling.ll