[PATCH] D149903: [VPlan] Replace IR based truncateToMinimalBitwidths with VPlan version.

Sat Oct 21 14:02:58 PDT 2023

Ayal added a comment.

Various comments, also trying to reason about how this patch changes tests.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3396
-    // If the value wasn't vectorized, we must maintain the original scalar
-    // type. The absence of the value from State indicates that it
-    // wasn't vectorized.
----------------
Retain a comment explaining why replicate recipes are not truncated?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3435
-        // considered undefined behavior. So, we can't unconditionally copy
-        // arithmetic wrapping flags to NewI.
-        cast<BinaryOperator>(NewI)->copyIRFlags(I, /*IncludeWrapFlags=*/false);
----------------
Retain this comment regarding dropping wrapping flags?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3450
-        case Instruction::Trunc:
-          NewI = ShrinkOperand(CI->getOperand(0));
-          break;
----------------
A Trunc is handled by shrinking its operand.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3475
-      } else if (isa<LoadInst>(I) || isa<PHINode>(I)) {
-        // Don't do anything with the operands, just extend the result.
-        continue;
----------------
(If nothing is done to the operands, what is the result extended too?)

================
Comment at: llvm/lib/Transforms/Vectorize/VPlan.h:280
-  }
-
   bool hasScalarValue(VPValue *Def, VPIteration Instance) {
----------------
fhahn wrote:
> Ayal wrote:
> > How/Is this removal related?
> The last user of this function has been removed in the patch.
Very well!

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:883
+    unsigned NewResSizeInBits = MinBWs.lookup(LiveInInst);
+    if (!LiveInInst || !NewResSizeInBits)
+      continue;
----------------
Suffice to ask `if (!NewResSizeInBits)`?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:887
+    Type *ResTy = LiveInInst->getType();
+    if (!ResTy->isIntegerTy())
+      continue;
----------------
assert "MinBW member must be integer" rather than continue - thereby skipping a MinBW member.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:903
+           vp_depth_first_deep(Plan.getEntry()))) {
+    for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+      if (auto *Mem = dyn_cast<VPWidenMemoryInstructionRecipe>(&R)) {
----------------
Can skip phi's, none are included in MinBWs.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:904
+    for (VPRecipeBase &R : make_early_inc_range(*VPBB)) {
+      if (auto *Mem = dyn_cast<VPWidenMemoryInstructionRecipe>(&R)) {
+#ifndef NDEBUG
----------------
Are any loads included in MinBWs, or is this dead code? Stores of course are irrelevant.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:917
+      unsigned NewResSizeInBits = MinBWs.lookup(UI);
+      if (!UI || !NewResSizeInBits)
+        continue;
----------------
Suffice to ask `if (!NewResSizeInBits)`?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:926
+      // for replicate recipes in MinBWs. Skip those here, after incrementing
+      // ProcessedRecipes.
+      if (isa<VPReplicateRecipe>(&R))
----------------
Should replicate recipes be handled next to handling widen memory recipes above?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:930
+      unsigned ResSizeInBits = getTypeSizeInBits(ResultVPV);
+      Type *ResTy = UI->getType();
+      assert(ResTy->isIntegerTy() && "only integer types supported");
----------------
nit: `ResTy` >> `OldResTy`, `ResSizeInBits` >> `OldResSizeInBits`

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:933
+      if (ResSizeInBits == NewResSizeInBits)
+        continue;
+
----------------
`assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");` here instead of below?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:939
+      // Try to replace wider SExt/ZExts with narrower ones if possible.
+      if (auto *VPC = dyn_cast<VPWidenCastRecipe>(&R)) {
+        unsigned Opc = VPC->getOpcode();
----------------
nit: `VPC` >> `OldExt`, `Opc` >> `OldOpc`?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:943
+          assert(ResSizeInBits > NewResSizeInBits && "Nothing to shrink?");
+          // SExt/Zext is redundant - stick with its operand.
+          Instruction::CastOps Opcode = VPC->getOpcode();
----------------
Comment is obsolete here - dealt with new type being equal to operand type, which should result in replacing the SExt/ZExt with its operand, see below.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:944
+          // SExt/Zext is redundant - stick with its operand.
+          Instruction::CastOps Opcode = VPC->getOpcode();
+          VPValue *Op = R.getOperand(0);
----------------
?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:948
+            Opcode = Instruction::Trunc;
+          auto *C = new VPWidenCastRecipe(Opcode, Op, NewResTy);
+          C->insertBefore(VPC);
----------------
nit: `C` >> `NewCast`?

If getTypeSizeInBits(Op) == NewResSizeInBits should C be set to Op (w/o inserting it) instead of creating a redundant cast?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:815
+  unsigned ProcessedRecipes = 0;
+  for (VPValue *VPV : Plan.getLiveIns()) {
+    auto *UI = dyn_cast<Instruction>(VPV->getLiveInIRValue());
----------------
Ayal wrote:
> (Future) Thought: wonder if instead of iterating over all live-ins looking to truncate any, it may be better to iterate over MinBWs and check if any are live-ins. Or lookup MinBWs upon construction of a live-in.
Thoughts about the above? Hopefully avoids exposing getLiveIns(), at the expense of holding a mapping between Values and LiveIns, as in LiveOuts.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:849
+      VPValue *ResultVPV = R.getVPSingleValue();
+      auto *UI = cast_or_null<Instruction>(ResultVPV->getUnderlyingValue());
+      auto I = MinBWs.find(UI);
----------------
Ayal wrote:
> (Future) Thought: this is an awkward way of retrieving "the" recipe that corresponds to each member of MinBWs - look through all recipes for those having the desired "underlying" insn. Perhaps better lookup MinBWs upon construction of a recipe for an Instruction.
> Or migrate the analysis that builds MinBWs to run on VPlan.
Thoughts about the above?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:871
+        default:
+          break;
+        case Instruction::SExt:
----------------
fhahn wrote:
> Ayal wrote:
> > This deals only with ZExt/SExt, easier to check directly if Opcode is one or the other?
> > 
> > OTOH, better handle Trunc here as well? Is it handled well below?
> Thanks, changed to `if`. I don't think Trunc is handled explicitly in the latest version.
Does Trunc (which can truncate to a smaller bitwidth) implicitly fall through and has its operand shrunk to the smaller bitwidth, effectively turning it into a ZExt?

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:896
+        unsigned OpSizeInBits = GetSizeInBits(Op);
+        if (OpSizeInBits == NewResSizeInBits)
+          continue;
----------------
fhahn wrote:
> Ayal wrote:
> > This means the size of all operands is equal to NewResSizeInBits, can this be? 
> There are cases where a Zext  narrowed earlier is used as operand here, so the tie is already adjusted.
Maybe worth a comment.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:790
+          if (GetSizeInBits(R.getOperand(0)) >= NewResSizeInBits)
+            break;
+          auto *C = new VPWidenCastRecipe(cast<CastInst>(UI)->getOpcode(),
----------------
fhahn wrote:
> Ayal wrote:
> > fhahn wrote:
> > > Ayal wrote:
> > > > OK, operand < ResTy due to SExt/ZExt,
> > > > and NewResTy < ResTy due to MinBW.
> > > > NewResTy == ResTy cases should arguably be excluded from MinBWs? (independent of this patch)
> > > > Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and `continue` - why is the "Extend result to original width" part skipped in this case?
> > > > If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by `break`, followed by ZExt back to ResTy.
> > > > Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.
> > > > Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?
> > > 
> > > In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.
> > > 
> > > > If OTOH operand > NewResTy a Trunc is needed rather than an Extend, and provided by subsequent code which is reached by break, followed by ZExt back to ResTy.
> > > 
> > > Yep.
> > > 
> > > > Otherwise if operand == NewResTy, the SExt/ZExt could be dropped, but we keep it and end up generating a redundant ZExt from R to ResTy - which have same sizes? It's probably ok because the knowledge that NewResTy bits suffice is already there, but would be good to clarify/clean up.
> > > 
> > > Yes we would at the moment generate redundant extend/trunc chains, which would indeed be good to clean up. I think we could fold those as follow-up.
> > > 
> > > 
> > >> Now if operand < NewResTy (< ResTy) then we SExt/ZExt the operand directly to NewResTy instead, and continue - why is the "Extend result to original width" part skipped in this case?
> > 
> > > In that case, the original (wider) cast is replaced by a new (narrower) cast and there's no need to truncate.
> > 
> > Yes, the extend-to-Res is replaced by a narrower extend-to-NewRes, but w/o another extend-back-to-Res to provide the original width, might it feed a user, say, a binary operation with mismatched size operands - where the other operand can also shrink to NewRes (as guaranteed by MinBWs) but was extended-back-to-Res? I.e., should all shrunks extend-back-to-Res, or none of them? May need better test coverage.
> Hm I am not sure, but if MinBWs is set the a specific bit width, wouldn't this require that all users to have the same minimal bit width for the value?
Agreed - MinBW should specify a consistent minimal bit width for all users, and for all operands, but there seems to be some discrepancy that is confusing:

A. Instructions whose operands and return value are all of a single type (excluding condition operand of selects) are converted to operate on a narrower type by (a) shrinking their operands to the narrower type and (b) extending their result from the narrower type to their original type. Instructions that feed values to such instructions or use their values, continue to feed and use values of the original type.
A pair of such instructions where one feeds the other will be added a zext-trunc pair between them which will later be folded.

B. Instructions that convert between two distinct types, continue to digest the original source type but are updated to produce values of the new destination type. Their users, when reached subsequently, need to check if any of their operands have been narrowed. But if this is the case, why bother expanding results in (b) above? OTOH, the narrowed results of conversion instructions can also be expanded (to be folded later), keeping the treatment consistent? Always expecting the new type to be strictly smaller than the current one. Perhaps conversion instructions could be skipped now and handled by subsequent folding pass - looking for trunc-trunc and sext-trunc pairs in addition to zext-trunc ones?

C. Loads are ignored - excluded from MiinBWs? They could potentially be narrowed to load only the required bits, though its unclear if a strided narrow load is better than a unit-strided wider load and trunc - as in an interleave-group(?)

D. Phis are ignored - excluded from MinBWs. Truncated header induction phi's are handled separately. Other phi's may deserve narrowing(?)

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:760
+    VPlan &Plan, const MapVector<Instruction *, uint64_t> &MinBWs) {
+  auto GetType = [](VPValue *Op) {
+    auto *UV = Op->getUnderlyingValue();
----------------
fhahn wrote:
> Ayal wrote:
> > nit: can return the type size in bits, as that is what is needed here. Op >> VPV?
> > 
> > Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type. 
> Adjusted to return size in bits to simplify code, thanks!
> 
> > Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.
> 
> Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited
>> Thought: worth introducing as a member of VPValue, to be overridden by VPWidenCastRecipe? Note that this is Element/Scalar Type.
> Effectively adding scalar type info to all VPValues? Might be good to investigate separately, although the current use-cases would probably be very limited

Very well.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp:785
+
+      if (!isa<VPWidenRecipe, VPWidenSelectRecipe, VPWidenCastRecipe>(&R))
+        continue;
----------------
fhahn wrote:
> Ayal wrote:
> > nit: this can be checked first, instead of checking for single defined value.
> > 
> > Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?
> Moved the check up, thanks!
> 
> > Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?
> 
> Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?
>> Thought: could/should each MinBW be attached to its recipe asap - when the latter is created, considering it depends on associated underlying instruction?
> Might be a potential follow-up, but we would still potentially updated MinBWs on each recipe replacement?

Sure, like updating any other property of a recipe when replaced.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanTransforms.h:72
+  /// Insert truncates and extends for any truncated instructions as hints to
+  /// InstCombine.
+  static void
----------------
Ayal wrote:
> nit: a VPlan transform should fold redundant ZExt-Trunc pairs rather than leaving them ("as hints") to `InstCombine`.
> 
> Being a public method, which does not need SE, should the caller of optimize() precede its call with a direct call to trunctateToMinimalBitwidth(), rather than pass MinBWs to optimize()?
Thoughts on the above?
Better truncate to minimal bitwidth asap, as it relies on IR information? Conceptually a scalar transform.
Does "as hints to InstCombine" below still hold?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:41
 ; CHECK-NEXT:    [[WIDE_LOAD3:%.*]] = load <16 x i8>, ptr [[TMP8]], align 1
 ; CHECK-NEXT:    [[TMP9:%.*]] = zext <16 x i8> [[WIDE_LOAD3]] to <16 x i16>
+; CHECK-NEXT:    [[TMP10:%.*]] = mul nuw <16 x i16> [[TMP9]], [[TMP2]]
----------------
hmm, we now spot the redundant duplicate zext of WIDE_LOAD from <16 x i8> to <16 x i16>, originally both TMP4 and TMP10.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:74
+; CHECK-NEXT:    [[WIDE_LOAD10:%.*]] = load <8 x i8>, ptr [[TMP21]], align 1
+; CHECK-NEXT:    [[TMP22:%.*]] = zext <8 x i8> [[WIDE_LOAD10]] to <8 x i16>
+; CHECK-NEXT:    [[TMP23:%.*]] = mul nuw <8 x i16> [[TMP22]], [[TMP15]]
----------------
Spotted and removed duplicate zext of WIDE_LOAD8.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:159
 ; CHECK-NEXT:  iter.check:
+; CHECK-NEXT:    [[CONV10:%.*]] = zext i16 [[B]] to i32
 ; CHECK-NEXT:    br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.*]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.*]]
----------------
This testcase stores the 2nd least significant byte of a 32b product (of two invariant values, one 16b and the other 32b) checking that computing 16b product suffices. But more optimizations should take place: the expansion of the multipliers to 32b should be eliminated (along with their truncation to 16b), and the invariant multiplication-lshr-trunc sequence should be hoisted out of the loop.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:167
+; CHECK-NEXT:    [[TMP0:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0
----------------
BROADCAST_SPLAT is (still) trunc'ed twice due to UF=2?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:168
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i16>
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <16 x i32> poison, i32 [[A]], i64 0
+; CHECK-NEXT:    [[BROADCAST_SPLAT2:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT1]], <16 x i32> poison, <16 x i32> zeroinitializer
----------------
Both insertelement's now use poison.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/deterministic-type-shrinkage.ll:174
+; CHECK-NEXT:    [[TMP2:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
+; CHECK-NEXT:    [[TMP3:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT2]] to <16 x i16>
+; CHECK-NEXT:    [[TMP4:%.*]] = mul <16 x i16> [[TMP2]], [[TMP0]]
----------------
BROADCAST_SPLAT2 is (still) trunc'ed twice due to UF=2?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:308
 ; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
 ; CHECK-NEXT:    [[WIDE_LOAD:%.*]] = load <16 x i8>, ptr [[TMP3]], align 1
 ; CHECK-NEXT:    [[TMP4:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i16>
----------------
We now fold a trunc-zext of zext'ed WIDE_LOAD from <16 x i16> => <16 x i32> => <16 x i16>,
but fail to fold a similar one following the add-2's?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:338
+; CHECK-NEXT:    [[TMP14:%.*]] = zext <8 x i8> [[WIDE_LOAD6]] to <8 x i16>
+; CHECK-NEXT:    [[TMP15:%.*]] = add <8 x i16> [[TMP14]], <i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2, i16 2>
+; CHECK-NEXT:    [[TMP16:%.*]] = zext <8 x i16> [[TMP15]] to <8 x i32>
----------------
We now get rid of a pair of <8 x i16> => <8 x i32> => <8 x i16> before the add-2's (so this is not an NFC patch), but still retain the pair of <8 x i16> => <8 x i32> => <8 x i16> after it - missed MinBW/trunc-zext opportunity?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:484
 ; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[CONV13]], i64 0
-; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLATINSERT]] to <16 x i8>
-; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[TMP1]], <16 x i8> poison, <16 x i32> zeroinitializer
-; CHECK-NEXT:    [[TMP2:%.*]] = zext <16 x i8> [[BROADCAST_SPLAT]] to <16 x i32>
+; CHECK-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
+; CHECK-NEXT:    [[TMP1:%.*]] = trunc <16 x i32> [[BROADCAST_SPLAT]] to <16 x i8>
----------------
Hmm, before we narrowed these two sufflevectors to operate on <16 x i8> and zext-trunc their result, now we let them operate on original <16 x i32> and truncate the result?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:498
+; CHECK-NEXT:    [[TMP7:%.*]] = zext <16 x i8> [[TMP6]] to <16 x i32>
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc <16 x i32> [[TMP7]] to <16 x i8>
+; CHECK-NEXT:    [[TMP9:%.*]] = add <16 x i8> [[TMP8]], <i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32, i8 32>
----------------
Many zext-trunc pairs left to collect.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:513
-; CHECK-NEXT:    [[TMP19:%.*]] = trunc <16 x i32> [[TMP13]] to <16 x i8>
-; CHECK-NEXT:    [[TMP20:%.*]] = trunc <16 x i32> [[TMP2]] to <16 x i8>
-; CHECK-NEXT:    [[TMP21:%.*]] = and <16 x i8> [[TMP19]], [[TMP20]]
----------------
Above trunc of TMP2 is redundant along with its zext in the ph.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/loop-vectorization-factors.ll:520
-; CHECK-NEXT:    [[TMP26:%.*]] = trunc <16 x i32> [[TMP25]] to <16 x i8>
-; CHECK-NEXT:    [[TMP27:%.*]] = trunc <16 x i32> [[TMP4]] to <16 x i8>
-; CHECK-NEXT:    [[TMP28:%.*]] = xor <16 x i8> [[TMP26]], [[TMP27]]
----------------
Above trunc of TMP4 is redundant along with its zext in the ph.

================
Comment at: llvm/test/Transforms/LoopVectorize/trunc-shifts.ll:334
+; CHECK-NEXT:    [[TMP7:%.*]] = trunc <4 x i32> [[TMP6]] to <4 x i16>
+; CHECK-NEXT:    [[TMP8:%.*]] = trunc <4 x i16> [[TMP7]] to <4 x i8>
+; CHECK-NEXT:    store <4 x i8> [[TMP8]], ptr [[TMP3]], align 8
----------------
We now get rid of a pair of <4 x i16> => <4 x i32> => <4 x i16> before the lshr (so this is not an NFC patch), but still retain the pair/triple of <4 x i16> => <4 x i32> => <4 x i16> => <4 x i8> after it - missed MinBW opportunity?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D149903/new/

https://reviews.llvm.org/D149903