[PATCH] D99750: [LV, VP]VP intrinsics support for the Loop Vectorizer

Mon Sep 25 08:17:44 PDT 2023

ABataev added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8112
   // When not folding the tail, use nullptr to model all-true mask.
-  if (!CM.foldTailByMasking()) {
+  if (!CM.foldTailByMasking() || CM.useVPIVectorization()) {
     BlockMaskCache[Header] = nullptr;
----------------
fhahn wrote:
> ABataev wrote:
> > fhahn wrote:
> > > Better to replace the mask together with introducing EVL to make sure EVL gets added when the mask gets removed?
> > Currently it will require some extra work. We'll need to handle both cases, with activelane instrnsics and direct comparison. Would be possible to keep it for now and fix it once you land emission of activelane intrinsic in VPlan-toVPlan transform?
> With the latest version, can the `useVPWithVPEVLVectorization` part be dropped (if the transform is updated to remove the mask from load/stores)?
Not quite, it will require an extra VPValue, something like VPAllTrueMask, which should replace IV <= BTC. Shall I add it?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:832
+      continue;
+    auto *NewInst =
+        new VPInstruction(VPInstruction::ExplicitVectorLengthIVIncrement,
----------------
fhahn wrote:
> ABataev wrote:
> > fhahn wrote:
> > > I think turning the step of the canonical induction non-loop-invariant technically turns the canonical IV into a phi that's not a canonical IV any more (which is guaranteed to step the same amount each iteration). Would it work to keep the increment unchanged and keep rounding up the trip count was with regular tail folding initially? Further down the line, the canonical IV issue may be resolved by also replacing the canonical IV node with a regular scalar phi when doing the replacement here.
> > I'll try to improve this.
> Did you get a chance to try this out yet? 
> 
> 97687b7aea17 landed, it would probably be good to also remove the header mask from load/store recipes here, to make clear that this optimizes the tail-folded loop?
Already did. The loop is countable, adding a new phi won't give anything, just some extra work without any effect.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.h:70

+  /// Replace (ICMP_ULE, wide canonical IV, backedge-taken-count) checks with a
+  /// Vector Predicated instructions.
----------------
fhahn wrote:
> It look like the current implementation needs to be updated to actually replace the checks. It also adjusts the induction increment, would be good to check if that is actually needed in the initial version, as per comments elsewhere.
1. Yes, need to add VPPAllTrye mask vp value.
2. It is required!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99750/new/

https://reviews.llvm.org/D99750