[PATCH] D149903: [VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.

Thu Nov 16 14:15:43 PST 2023

fhahn added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3396
-    // If the value wasn't vectorized, we must maintain the original scalar
-    // type. The absence of the value from State indicates that it
-    // wasn't vectorized.
----------------
Ayal wrote:
> Retain a comment explaining why replicate recipes are not truncated?
Retained when skipping VPReplicateRecipe.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3435
-        // considered undefined behavior. So, we can't unconditionally copy
-        // arithmetic wrapping flags to NewI.
-        cast<BinaryOperator>(NewI)->copyIRFlags(I, /*IncludeWrapFlags=*/false);
----------------
Ayal wrote:
> Retain this comment regarding dropping wrapping flags?
Done, thanks!

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3475
-      } else if (isa<LoadInst>(I) || isa<PHINode>(I)) {
-        // Don't do anything with the operands, just extend the result.
-        continue;
----------------
Ayal wrote:
> (If nothing is done to the operands, what is the result extended too?)
It stays the same, there's no extend in that case.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:883
+    unsigned NewResSizeInBits = MinBWs.lookup(LiveInInst);
+    if (!LiveInInst || !NewResSizeInBits)
+      continue;
----------------
Ayal wrote:
> Suffice to ask `if (!NewResSizeInBits)`?
This code has now been removed; LiveIns are handled when truncating the other operands of an instruction; otherwise we leave the type info in an inconsistent state.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:887
+    Type *ResTy = LiveInInst->getType();
+    if (!ResTy->isIntegerTy())
+      continue;
----------------
Ayal wrote:
> assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.
Turned into an assert, thanks!

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:903
+           vp_depth_first_deep(Plan.getEntry()))) {
+    for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+      if (auto *Mem = dyn_cast<VPWidenMemoryInstructionRecipe>(&R)) {
----------------
Ayal wrote:
> Can skip phi's, none are included in MinBWs.
There's an early continue now that skips phis and other unsupported recipes.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:904
+    for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+      if (auto *Mem = dyn_cast<VPWidenMemoryInstructionRecipe>(&R)) {
+#ifndef NDEBUG
----------------
Ayal wrote:
> Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.
Nope, looks like this is not needed in the latest version.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:917
+      unsigned NewResSizeInBits = MinBWs.lookup(UI);
+      if (!UI || !NewResSizeInBits)
+        continue;
----------------
Ayal wrote:
> Suffice to ask `if (!NewResSizeInBits)`?
Simplified, thanks!

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:926
+      // for replicate recipes in MinBWs. Skip those here, after incrementing
+      // ProcessedRecipes.
+      if (isa<VPReplicateRecipe>(&R))
----------------
Ayal wrote:
> Should replicate recipes be handled next to handling widen memory recipes above?
We still need to count them for verification 

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:930
+      unsigned ResSizeInBits = getTypeSizeInBits(ResultVPV);
+      Type *ResTy = UI->getType();
+      assert(ResTy->isIntegerTy() && "only integer types supported");
----------------
Ayal wrote:
> nit: `ResTy` >> `OldResTy`, `ResSizeInBits` >> `OldResSizeInBits`
Renamed, thanks!

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:933
+      if (ResSizeInBits == NewResSizeInBits)
+        continue;
+
----------------
Ayal wrote:
> `assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");` here instead of below?
Done, and also removed continue

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:939
+      // Try to replace wider SExt/ZExts with narrower ones if possible.
+      if (auto *VPC = dyn_cast<VPWidenCastRecipe>(&R)) {
+        unsigned Opc = VPC->getOpcode();
----------------
Ayal wrote:
> nit: `VPC` >> `OldExt`, `Opc` >> `OldOpc`?
This code is now gone, handled by recipe simplification.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:943
+          assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");
+          // SExt/Zext is redundant - stick with its operand.
+          Instruction::CastOps Opcode = VPC->getOpcode();
----------------
Ayal wrote:
> Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.
Code is gone now

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:944
+          // SExt/Zext is redundant - stick with its operand.
+          Instruction::CastOps Opcode = VPC->getOpcode();
+          VPValue *Op = R.getOperand(0);
----------------
Ayal wrote:
> ?
Code now gone.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:948
+            Opcode = Instruction::Trunc;
+          auto *C = new VPWidenCastRecipe(Opcode, Op, NewResTy);
+          C->insertBefore(VPC);
----------------
Ayal wrote:
> nit: `C` >> `NewCast`?
> 
> If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?
Code gone now.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:815
+  unsigned ProcessedRecipes = 0;
+  for (VPValue *VPV : Plan.getLiveIns()) {
+    auto *UI = dyn_cast<Instruction>(VPV->getLiveInIRValue());
----------------
Ayal wrote:
> Ayal wrote:
> > (Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.
> Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.
LiveIns are now handled directly when truncating other operands; getLiveIns has been removed.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:849
+      VPValue *ResultVPV = R.getVPSingleValue();
+      auto *UI = cast_or_null<Instruction>(ResultVPV->getUnderlyingValue());
+      auto I = MinBWs.find(UI);
----------------
Ayal wrote:
> Ayal wrote:
> > (Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
> > Or migrate the analysis that builds MinBWs to run on VPlan.
> Thoughts about the above?
I think it would be best to have the analysis based on VPlan. Building MinBWs early would probably require extra work to update/invalidate it during transforms.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:790
+          if (GetSizeInBits(R.getOperand(0)) >= NewResSizeInBits)
+            break;
+          auto *C = new VPWidenCastRecipe(cast<CastInst>(UI)->getOpcode(),
----------------
Ayal wrote:
> fhahn wrote:
> > Ayal wrote:
> > > fhahn wrote:
> > > > Ayal wrote:
> > > > > OK, operand < ResTy due to SExt/ZExt,
> > > > > and NewResTy < ResTy due to MinBW.
> > > > > NewResTy == ResTy cases should arguably be excluded from MinBWs? (independent of this patch)
> > > > > Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and `continue` - why is the "Extend result to original width" part skipped in this case?
> > > > > If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by `break`, followed by ZExt back to ResTy.
> > > > > Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.
> > > > > Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?
> > > > 
> > > > In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.
> > > > 
> > > > > If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by break, followed by ZExt back to ResTy.
> > > > 
> > > > Yep.
> > > > 
> > > > > Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.
> > > > 
> > > > Yes we would at the moment generate redundant extend/trunc chains, which would indeed be good to clean up. I think we could fold those as follow-up.
> > > > 
> > > > 
> > > >> Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?
> > > 
> > > > In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.
> > > 
> > > Yes, the extend-to-Res is replaced by a narrower extend-to-NewRes, but w/o another extend-back-to-Res to provide the original width, might it feed a user, say, a binary operation with mismatched size operands - where the other operand can also shrink to NewRes (as guaranteed by MinBWs) but was extended-back-to-Res? I.e., should all shrunks extend-back-to-Res, or none of them? May need better test coverage.
> > Hm I am not sure, but if MinBWs is set the a specific bit width, wouldn't this require that all users to have the same minimal bit width for the value?
> Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:
> 
> A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
> A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.
> 
> B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?
> 
> C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)
> 
> D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)
The latest version doesn't have special treatment for casts, they remain unchanged and VPlan recipe simplification will take care of folding them if possible.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:760
+    VPlan &Plan, const MapVector<Instruction *, uint64_t> &MinBWs) {
+  auto GetType = [](VPValue *Op) {
+    auto *UV = Op->getUnderlyingValue();
----------------
Ayal wrote:
> fhahn wrote:
> > Ayal wrote:
> > > nit: can return the type size in bits, as that is what is needed here. Op >> VPV?
> > > 
> > > Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type. 
> > Adjusted to return size in bits to simplify code, thanks!
> > 
> > > Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.
> > 
> > Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited
> >> Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.
> > Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited
> 
> Very well.
This has been updated to now use VPTypeAnalysis.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:159
 ; CHECK-NEXT:  iter.check:
+; CHECK-NEXT:    [[CONV10:%.*]] = zext i16 [[B]] to i32
 ; CHECK-NEXT:    br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
----------------
Ayal wrote:
> This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.
still more work to do :) 

Arguably the invariant instructions are artificial, in the regular pipeline, no invariant instructions should remain.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:167
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0
----------------
Ayal wrote:
> BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?
The latest version avoids truncating the same value twice.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:168
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT1]], <16 x i32> poison, <16 x i32> zeroinitializer
----------------
Ayal wrote:
> Both insertelement's now use poison.
I think the use of undef is a leftover that wasn't updated; it should be poison.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:174
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
+; CHECK-NEXT:    [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
+; CHECK-NEXT:    [[TMP4:%.*]] = mul <16 x i16> [[TMP2]], [[TMP0]]
----------------
Ayal wrote:
> BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?
The latest version avoids truncating the same value twice.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:308
 ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
 ; CHECK-NEXT:    [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
----------------
Ayal wrote:
> We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
> but fail to fold a similar one following the add-2's?
folding now happens all in simplifyRecieps, should handle this now

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:484
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0
-; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT]] to <16 x i8>
-; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP2:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT]] to <16 x i32>
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i8>
----------------
Ayal wrote:
> Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?
I think there's nothing we can do about that; we first need to splat the value when generating code, but InstCombine should take care of that.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:498
+; CHECK-NEXT:    [[TMP7:%.*]] = zext <16 x i8> [[TMP6]] to <16 x i32>
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc <16 x i32> [[TMP7]] to <16 x i8>
+; CHECK-NEXT:    [[TMP9:%.*]] = add <16 x i8> [[TMP8]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
----------------
Ayal wrote:
> Many zext-trunc pairs left to collect.
Should be better cleaned up now

================
Comment at: llvm/test/Transforms/LoopVectorize/trunc-shifts.ll:334
+; CHECK-NEXT:    [[TMP7:%.*]] = trunc <4 x i32> [[TMP6]] to <4 x i16>
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc <4 x i16> [[TMP7]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP8]], ptr [[TMP3]], align 8
----------------
Ayal wrote:
> We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?
trunc/ext pairs should be better cleaned up in the latest version

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D149903/new/

https://reviews.llvm.org/D149903