[llvm] r366799 - [TargetLowering] Add SimplifyMultipleUseDemandedBits

Jordan Rupprecht via llvm-commits llvm-commits at lists.llvm.org
Mon Jul 29 06:35:36 PDT 2019


Hi Simon,
Thanks! That appears to at least fix the msan+O2 issue, it'll take a bit
more time to verify the other failure we're seeing, but I'm hopeful.

Anyway, not sure if you still need it, but here's what creduce finished
over the weekend with for the msan+-O2 test case:
$ clang -O2 -fsanitize=memory -c p224-64.c -o p224-64.o
=> Takes ~.2-.5s w/o -fsanitize=memory, ~31s w/ -fsanitize=memory (>100x
slowdown)
=> After rL367171, no difference :)
$ cat p224-64.c
typedef long a[4];
typedef __uint128_t b[7];
long *c;
long d;
__uint128_t e, l;
a n;
void u(b);
void t() {
  a q;
  b r;
  u(r);
  *r = c[2];
  long *f = n;
  __uint128_t *g = r;
  __uint128_t j = g[3];
  l += j << 40;
  f[3] = l;
  {
    __uint128_t *f = r;
    long *g = n;
    f[5] = g[3];
    f[6] = g[3] * g[3];
  }
  {
    j = g[2];
    l = g[3] = g[6] << 40;
    j -= g[6];
    l += j += g[5] << 40;
    f[3] = l;
    __uint128_t *f = r;
    long *g = n;
    f[2] = f[5] = g[3];
    f[6] = g[3] * g[3];
  }
  {
    f = q;
    j = g[2];
    l = g[6] << 40;
    j -= g[6];
    l += j += g[5] << 40;
    f[3] = l;
  }
  for (long i = 0; i < 11; ++i) {
    __uint128_t *f = r;
    long *g = q;
    f[2] = f[5] = g[3];
    f[6] = g[3] * g[3];
    {
      long *f = q;
      __uint128_t *g = r;
      __uint128_t m, j = g[2];
      m = g[3] += g[6] << 40;
      j -= g[6];
      m += j += g[5] << 40;
      f[2] = f[3] = m;
    }
  }
  {
    __uint128_t *f = r;
    long *k = q;
    f[3] = d * c[1] * k[2];
    __uint128_t *s = r;
    e = s[3] << 40;
  }
}

On Sat, Jul 27, 2019 at 9:44 AM Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:

> Should be fixed at rL367171 - please can you confirm?
> On 27/07/2019 01:54, Jordan Rupprecht wrote:
>
> We're also seeing some tests failures (possibly broken by a different
> commit) that use the llvm api and are now timing out, and when we kill them
> we see this stack trace:
>
>     @     0x55e2fdc55b13        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc54c35         96  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc574c5        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc54c35         96  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc560d2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc563fa        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc56df8        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55d65        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc5643d        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ed1        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc56187        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55d21        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc5643d        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55a8a        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ad2        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55a8a        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55a8a        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc563fa        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55ed1        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc56187        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55d21        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc5643d        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc55a8a        832  llvm::SelectionDAG::computeKnownBits()
>     @     0x55e2fdc563fa        832  llvm::SelectionDAG::computeKnownBits()
>     @ ... and at least 221 more frames
>
> So... something related to the depth checking is not working correctly
>
> On Fri, Jul 26, 2019 at 3:53 PM Jordan Rupprecht <rupprecht at google.com>
> wrote:
>
>> Sorry for the late follow up, it looks like this might be causing some
>> compiler timeouts when processing some files with msan & -O2 (& maybe other
>> opts) on this file:
>> https://boringssl.googlesource.com/boringssl/+/2623/crypto/ec/p224-64.c
>>
>> I'm making some progress on a reduced test case now
>>
>> On Tue, Jul 23, 2019 at 5:38 AM Simon Pilgrim via llvm-commits <
>> llvm-commits at lists.llvm.org> wrote:
>>
>>> Author: rksimon
>>> Date: Tue Jul 23 05:39:08 2019
>>> New Revision: 366799
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=366799&view=rev
>>> Log:
>>> [TargetLowering] Add SimplifyMultipleUseDemandedBits
>>>
>>> This patch introduces the DAG version of
>>> SimplifyMultipleUseDemandedBits, which attempts to peek through ops (mainly
>>> and/or/xor so far) that don't contribute to the demandedbits/elts of a node
>>> - which means we can do this even in cases where we have multiple uses of
>>> an op, which normally requires us to demanded all bits/elts. The intention
>>> is to remove a similar instruction - SelectionDAG::GetDemandedBits - once
>>> SimplifyMultipleUseDemandedBits has matured.
>>>
>>> The InstCombine version of SimplifyMultipleUseDemandedBits can constant
>>> fold which I haven't added here yet, and so far I've only wired this up to
>>> some basic binops (and/or/xor/add/sub/mul) to demonstrate its use.
>>>
>>> We do see a couple of regressions that need to be addressed:
>>>
>>>     AMDGPU unsigned dot product codegen retains an AND mask (for
>>> ZERO_EXTEND) that it previously removed (but otherwise the dotproduct
>>> codegen is a lot better).
>>>
>>>     X86/AVX2 has poor handling of vector
>>> ANY_EXTEND/ANY_EXTEND_VECTOR_INREG - it prematurely gets converted to
>>> ZERO_EXTEND_VECTOR_INREG.
>>>
>>> The code owners have confirmed its ok for these cases to fixed up in
>>> future patches.
>>>
>>> Differential Revision: https://reviews.llvm.org/D63281
>>>
>>> Modified:
>>>     llvm/trunk/include/llvm/CodeGen/TargetLowering.h
>>>     llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp
>>>     llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>>     llvm/trunk/test/CodeGen/AArch64/bitfield-insert.ll
>>>     llvm/trunk/test/CodeGen/AMDGPU/idot4s.ll
>>>     llvm/trunk/test/CodeGen/AMDGPU/idot4u.ll
>>>     llvm/trunk/test/CodeGen/AMDGPU/idot8s.ll
>>>     llvm/trunk/test/CodeGen/AMDGPU/idot8u.ll
>>>     llvm/trunk/test/CodeGen/AMDGPU/sdiv.ll
>>>     llvm/trunk/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
>>>     llvm/trunk/test/CodeGen/X86/2012-08-07-CmpISelBug.ll
>>>     llvm/trunk/test/CodeGen/X86/vector-fshl-128.ll
>>>     llvm/trunk/test/CodeGen/X86/vector-reduce-mul-widen.ll
>>>     llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll
>>>
>>> Modified: llvm/trunk/include/llvm/CodeGen/TargetLowering.h
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/include/llvm/CodeGen/TargetLowering.h?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/include/llvm/CodeGen/TargetLowering.h (original)
>>> +++ llvm/trunk/include/llvm/CodeGen/TargetLowering.h Tue Jul 23 05:39:08
>>> 2019
>>> @@ -3065,6 +3065,14 @@ public:
>>>    bool SimplifyDemandedBits(SDValue Op, const APInt &DemandedMask,
>>>                              DAGCombinerInfo &DCI) const;
>>>
>>> +  /// More limited version of SimplifyDemandedBits that can be used to
>>> "look
>>> +  /// through" ops that don't contribute to the
>>> DemandedBits/DemandedElts -
>>> +  /// bitwise ops etc.
>>> +  SDValue SimplifyMultipleUseDemandedBits(SDValue Op, const APInt
>>> &DemandedBits,
>>> +                                          const APInt &DemandedElts,
>>> +                                          SelectionDAG &DAG,
>>> +                                          unsigned Depth) const;
>>> +
>>>    /// Look at Vector Op. At this point, we know that only the
>>> DemandedElts
>>>    /// elements of the result of Op are ever used downstream.  If we can
>>> use
>>>    /// this information to simplify Op, create a new simplified DAG node
>>> and
>>> @@ -3139,6 +3147,13 @@ public:
>>>                                                   TargetLoweringOpt &TLO,
>>>                                                   unsigned Depth = 0)
>>> const;
>>>
>>> +  /// More limited version of SimplifyDemandedBits that can be used to
>>> "look
>>> +  /// through" ops that don't contribute to the
>>> DemandedBits/DemandedElts -
>>> +  /// bitwise ops etc.
>>> +  virtual SDValue SimplifyMultipleUseDemandedBitsForTargetNode(
>>> +      SDValue Op, const APInt &DemandedBits, const APInt &DemandedElts,
>>> +      SelectionDAG &DAG, unsigned Depth) const;
>>> +
>>>    /// This method returns the constant pool value that will be loaded
>>> by LD.
>>>    /// NOTE: You must check for implicit extensions of the constant by
>>> LD.
>>>    virtual const Constant *getTargetConstantFromLoad(LoadSDNode *LD)
>>> const;
>>>
>>> Modified: llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp (original)
>>> +++ llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp Tue Jul 23
>>> 05:39:08 2019
>>> @@ -564,6 +564,61 @@ bool TargetLowering::SimplifyDemandedBit
>>>                                AssumeSingleUse);
>>>  }
>>>
>>> +// TODO: Can we merge SelectionDAG::GetDemandedBits into this?
>>> +// TODO: Under what circumstances can we create nodes? BITCAST?
>>> Constant?
>>> +SDValue TargetLowering::SimplifyMultipleUseDemandedBits(
>>> +    SDValue Op, const APInt &DemandedBits, const APInt &DemandedElts,
>>> +    SelectionDAG &DAG, unsigned Depth) const {
>>> +  KnownBits LHSKnown, RHSKnown;
>>> +  switch (Op.getOpcode()) {
>>> +  case ISD::AND: {
>>> +    LHSKnown = DAG.computeKnownBits(Op.getOperand(0), DemandedElts,
>>> Depth + 1);
>>> +    RHSKnown = DAG.computeKnownBits(Op.getOperand(1), DemandedElts,
>>> Depth + 1);
>>> +
>>> +    // If all of the demanded bits are known 1 on one side, return the
>>> other.
>>> +    // These bits cannot contribute to the result of the 'and' in this
>>> +    // context.
>>> +    if (DemandedBits.isSubsetOf(LHSKnown.Zero | RHSKnown.One))
>>> +      return Op.getOperand(0);
>>> +    if (DemandedBits.isSubsetOf(RHSKnown.Zero | LHSKnown.One))
>>> +      return Op.getOperand(1);
>>> +    break;
>>> +  }
>>> +  case ISD::OR: {
>>> +    LHSKnown = DAG.computeKnownBits(Op.getOperand(0), DemandedElts,
>>> Depth + 1);
>>> +    RHSKnown = DAG.computeKnownBits(Op.getOperand(1), DemandedElts,
>>> Depth + 1);
>>> +
>>> +    // If all of the demanded bits are known zero on one side, return
>>> the
>>> +    // other.  These bits cannot contribute to the result of the 'or'
>>> in this
>>> +    // context.
>>> +    if (DemandedBits.isSubsetOf(LHSKnown.One | RHSKnown.Zero))
>>> +      return Op.getOperand(0);
>>> +    if (DemandedBits.isSubsetOf(RHSKnown.One | LHSKnown.Zero))
>>> +      return Op.getOperand(1);
>>> +    break;
>>> +  }
>>> +  case ISD::XOR: {
>>> +    LHSKnown = DAG.computeKnownBits(Op.getOperand(0), DemandedElts,
>>> Depth + 1);
>>> +    RHSKnown = DAG.computeKnownBits(Op.getOperand(1), DemandedElts,
>>> Depth + 1);
>>> +
>>> +    // If all of the demanded bits are known zero on one side, return
>>> the
>>> +    // other.
>>> +    if (DemandedBits.isSubsetOf(RHSKnown.Zero))
>>> +      return Op.getOperand(0);
>>> +    if (DemandedBits.isSubsetOf(LHSKnown.Zero))
>>> +      return Op.getOperand(1);
>>> +    break;
>>> +  }
>>> +  default:
>>> +    if (Op.getOpcode() >= ISD::BUILTIN_OP_END)
>>> +      if (SDValue V = SimplifyMultipleUseDemandedBitsForTargetNode(
>>> +              Op, DemandedBits, DemandedElts, DAG, Depth))
>>> +        return V;
>>> +    break;
>>> +  }
>>> +  return SDValue();
>>> +}
>>> +
>>>  /// Look at Op. At this point, we know that only the
>>> OriginalDemandedBits of the
>>>  /// result of Op are ever used downstream. If we can use this
>>> information to
>>>  /// simplify Op, create a new simplified DAG node and return true,
>>> returning the
>>> @@ -834,6 +889,20 @@ bool TargetLowering::SimplifyDemandedBit
>>>        return true;
>>>      assert(!Known2.hasConflict() && "Bits known to be one AND zero?");
>>>
>>> +    // Attempt to avoid multi-use ops if we don't need anything from
>>> them.
>>> +    if (!DemandedBits.isAllOnesValue() ||
>>> !DemandedElts.isAllOnesValue()) {
>>> +      SDValue DemandedOp0 = SimplifyMultipleUseDemandedBits(
>>> +          Op0, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      SDValue DemandedOp1 = SimplifyMultipleUseDemandedBits(
>>> +          Op1, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      if (DemandedOp0 || DemandedOp1) {
>>> +        Op0 = DemandedOp0 ? DemandedOp0 : Op0;
>>> +        Op1 = DemandedOp1 ? DemandedOp1 : Op1;
>>> +        SDValue NewOp = TLO.DAG.getNode(Op.getOpcode(), dl, VT, Op0,
>>> Op1);
>>> +        return TLO.CombineTo(Op, NewOp);
>>> +      }
>>> +    }
>>> +
>>>      // If all of the demanded bits are known one on one side, return
>>> the other.
>>>      // These bits cannot contribute to the result of the 'and'.
>>>      if (DemandedBits.isSubsetOf(Known2.Zero | Known.One))
>>> @@ -869,6 +938,20 @@ bool TargetLowering::SimplifyDemandedBit
>>>        return true;
>>>      assert(!Known2.hasConflict() && "Bits known to be one AND zero?");
>>>
>>> +    // Attempt to avoid multi-use ops if we don't need anything from
>>> them.
>>> +    if (!DemandedBits.isAllOnesValue() ||
>>> !DemandedElts.isAllOnesValue()) {
>>> +      SDValue DemandedOp0 = SimplifyMultipleUseDemandedBits(
>>> +          Op0, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      SDValue DemandedOp1 = SimplifyMultipleUseDemandedBits(
>>> +          Op1, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      if (DemandedOp0 || DemandedOp1) {
>>> +        Op0 = DemandedOp0 ? DemandedOp0 : Op0;
>>> +        Op1 = DemandedOp1 ? DemandedOp1 : Op1;
>>> +        SDValue NewOp = TLO.DAG.getNode(Op.getOpcode(), dl, VT, Op0,
>>> Op1);
>>> +        return TLO.CombineTo(Op, NewOp);
>>> +      }
>>> +    }
>>> +
>>>      // If all of the demanded bits are known zero on one side, return
>>> the other.
>>>      // These bits cannot contribute to the result of the 'or'.
>>>      if (DemandedBits.isSubsetOf(Known2.One | Known.Zero))
>>> @@ -901,6 +984,20 @@ bool TargetLowering::SimplifyDemandedBit
>>>        return true;
>>>      assert(!Known2.hasConflict() && "Bits known to be one AND zero?");
>>>
>>> +    // Attempt to avoid multi-use ops if we don't need anything from
>>> them.
>>> +    if (!DemandedBits.isAllOnesValue() ||
>>> !DemandedElts.isAllOnesValue()) {
>>> +      SDValue DemandedOp0 = SimplifyMultipleUseDemandedBits(
>>> +          Op0, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      SDValue DemandedOp1 = SimplifyMultipleUseDemandedBits(
>>> +          Op1, DemandedBits, DemandedElts, TLO.DAG, Depth + 1);
>>> +      if (DemandedOp0 || DemandedOp1) {
>>> +        Op0 = DemandedOp0 ? DemandedOp0 : Op0;
>>> +        Op1 = DemandedOp1 ? DemandedOp1 : Op1;
>>> +        SDValue NewOp = TLO.DAG.getNode(Op.getOpcode(), dl, VT, Op0,
>>> Op1);
>>> +        return TLO.CombineTo(Op, NewOp);
>>> +      }
>>> +    }
>>> +
>>>      // If all of the demanded bits are known zero on one side, return
>>> the other.
>>>      // These bits cannot contribute to the result of the 'xor'.
>>>      if (DemandedBits.isSubsetOf(Known.Zero))
>>> @@ -1663,6 +1760,7 @@ bool TargetLowering::SimplifyDemandedBit
>>>      // Add, Sub, and Mul don't demand any bits in positions beyond that
>>>      // of the highest bit demanded of them.
>>>      SDValue Op0 = Op.getOperand(0), Op1 = Op.getOperand(1);
>>> +    SDNodeFlags Flags = Op.getNode()->getFlags();
>>>      unsigned DemandedBitsLZ = DemandedBits.countLeadingZeros();
>>>      APInt LoMask = APInt::getLowBitsSet(BitWidth, BitWidth -
>>> DemandedBitsLZ);
>>>      if (SimplifyDemandedBits(Op0, LoMask, DemandedElts, Known2, TLO,
>>> @@ -1671,7 +1769,6 @@ bool TargetLowering::SimplifyDemandedBit
>>>                               Depth + 1) ||
>>>          // See if the operation should be performed at a smaller bit
>>> width.
>>>          ShrinkDemandedOp(Op, BitWidth, DemandedBits, TLO)) {
>>> -      SDNodeFlags Flags = Op.getNode()->getFlags();
>>>        if (Flags.hasNoSignedWrap() || Flags.hasNoUnsignedWrap()) {
>>>          // Disable the nsw and nuw flags. We can no longer guarantee
>>> that we
>>>          // won't wrap after simplification.
>>> @@ -1684,6 +1781,23 @@ bool TargetLowering::SimplifyDemandedBit
>>>        return true;
>>>      }
>>>
>>> +    // Attempt to avoid multi-use ops if we don't need anything from
>>> them.
>>> +    if (!LoMask.isAllOnesValue() || !DemandedElts.isAllOnesValue()) {
>>> +      SDValue DemandedOp0 = SimplifyMultipleUseDemandedBits(
>>> +          Op0, LoMask, DemandedElts, TLO.DAG, Depth + 1);
>>> +      SDValue DemandedOp1 = SimplifyMultipleUseDemandedBits(
>>> +          Op1, LoMask, DemandedElts, TLO.DAG, Depth + 1);
>>> +      if (DemandedOp0 || DemandedOp1) {
>>> +        Flags.setNoSignedWrap(false);
>>> +        Flags.setNoUnsignedWrap(false);
>>> +        Op0 = DemandedOp0 ? DemandedOp0 : Op0;
>>> +        Op1 = DemandedOp1 ? DemandedOp1 : Op1;
>>> +        SDValue NewOp =
>>> +            TLO.DAG.getNode(Op.getOpcode(), dl, VT, Op0, Op1, Flags);
>>> +        return TLO.CombineTo(Op, NewOp);
>>> +      }
>>> +    }
>>> +
>>>      // If we have a constant operand, we may be able to turn it into -1
>>> if we
>>>      // do not demand the high bits. This can make the constant smaller
>>> to
>>>      // encode, allow more general folding, or match specialized
>>> instruction
>>> @@ -2357,6 +2471,19 @@ bool TargetLowering::SimplifyDemandedBit
>>>    return false;
>>>  }
>>>
>>> +SDValue TargetLowering::SimplifyMultipleUseDemandedBitsForTargetNode(
>>> +    SDValue Op, const APInt &DemandedBits, const APInt &DemandedElts,
>>> +    SelectionDAG &DAG, unsigned Depth) const {
>>> +  assert(
>>> +      (Op.getOpcode() >= ISD::BUILTIN_OP_END ||
>>> +       Op.getOpcode() == ISD::INTRINSIC_WO_CHAIN ||
>>> +       Op.getOpcode() == ISD::INTRINSIC_W_CHAIN ||
>>> +       Op.getOpcode() == ISD::INTRINSIC_VOID) &&
>>> +      "Should use SimplifyMultipleUseDemandedBits if you don't know
>>> whether Op"
>>> +      " is a target node!");
>>> +  return SDValue();
>>> +}
>>> +
>>>  const Constant *TargetLowering::getTargetConstantFromLoad(LoadSDNode*)
>>> const {
>>>    return nullptr;
>>>  }
>>>
>>> Modified: llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp (original)
>>> +++ llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp Tue Jul 23
>>> 05:39:08 2019
>>> @@ -4308,6 +4308,7 @@ bool BoUpSLP::BlockScheduling::trySchedu
>>>      resetSchedule();
>>>      initialFillReadyList(ReadyInsts);
>>>    }
>>> +  assert(Bundle && "Failed to find schedule bundle");
>>>
>>>    LLVM_DEBUG(dbgs() << "SLP: try schedule bundle " << *Bundle << " in
>>> block "
>>>                      << BB->getName() << "\n");
>>>
>>> Modified: llvm/trunk/test/CodeGen/AArch64/bitfield-insert.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AArch64/bitfield-insert.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AArch64/bitfield-insert.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AArch64/bitfield-insert.ll Tue Jul 23
>>> 05:39:08 2019
>>> @@ -265,8 +265,7 @@ define void @test_32bit_opnd1_better(i32
>>>  define i32 @test_nouseful_bits(i8 %a, i32 %b) {
>>>  ; CHECK-LABEL: test_nouseful_bits:
>>>  ; CHECK:       // %bb.0:
>>> -; CHECK-NEXT:    mov w8, w0
>>> -; CHECK-NEXT:    bfi w8, w8, #8, #24
>>> +; CHECK-NEXT:    orr w8, w0, w8, lsl #8
>>>  ; CHECK-NEXT:    mov w9, w0
>>>  ; CHECK-NEXT:    bfi w9, w8, #8, #24
>>>  ; CHECK-NEXT:    bfi w0, w9, #8, #24
>>>
>>> Modified: llvm/trunk/test/CodeGen/AMDGPU/idot4s.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/idot4s.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AMDGPU/idot4s.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AMDGPU/idot4s.ll Tue Jul 23 05:39:08 2019
>>> @@ -899,41 +899,28 @@ define amdgpu_kernel void @idot4_acc16_v
>>>  ; GFX7-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0xd
>>>  ; GFX7-NEXT:    s_mov_b32 s3, 0xf000
>>>  ; GFX7-NEXT:    s_mov_b32 s2, -1
>>> -; GFX7-NEXT:    s_mov_b32 s8, 0xffff
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>>  ; GFX7-NEXT:    s_load_dword s4, s[4:5], 0x0
>>>  ; GFX7-NEXT:    buffer_load_ushort v0, off, s[0:3], 0
>>>  ; GFX7-NEXT:    s_load_dword s5, s[6:7], 0x0
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX7-NEXT:    s_sext_i32_i8 s6, s4
>>> -; GFX7-NEXT:    s_bfe_i32 s7, s4, 0x80008
>>> -; GFX7-NEXT:    s_sext_i32_i8 s10, s5
>>> +; GFX7-NEXT:    s_ashr_i32 s6, s4, 24
>>> +; GFX7-NEXT:    s_bfe_i32 s7, s4, 0x80010
>>> +; GFX7-NEXT:    s_bfe_i32 s10, s5, 0x80010
>>>  ; GFX7-NEXT:    s_bfe_i32 s11, s5, 0x80008
>>> -; GFX7-NEXT:    s_bfe_i32 s12, s5, 0x80010
>>> -; GFX7-NEXT:    s_ashr_i32 s5, s5, 24
>>> -; GFX7-NEXT:    v_mov_b32_e32 v3, s11
>>> -; GFX7-NEXT:    v_mov_b32_e32 v4, s10
>>> -; GFX7-NEXT:    s_bfe_i32 s9, s4, 0x80010
>>> -; GFX7-NEXT:    v_mov_b32_e32 v2, s12
>>> -; GFX7-NEXT:    s_ashr_i32 s4, s4, 24
>>> +; GFX7-NEXT:    s_ashr_i32 s9, s5, 24
>>> +; GFX7-NEXT:    s_sext_i32_i8 s5, s5
>>> +; GFX7-NEXT:    s_bfe_i32 s8, s4, 0x80008
>>> +; GFX7-NEXT:    s_sext_i32_i8 s4, s4
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v1, s5
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v1, s4, v1
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v2, s9, v2
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v3, s7, v3
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v4, s6, v4
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
>>> -; GFX7-NEXT:    v_and_b32_e32 v2, s8, v2
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
>>> -; GFX7-NEXT:    v_and_b32_e32 v4, s8, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v4, v3
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v3, 16, v2
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
>>> +; GFX7-NEXT:    v_mov_b32_e32 v2, s11
>>> +; GFX7-NEXT:    v_mov_b32_e32 v3, s10
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s4, v1, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s8, v2, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s7, v3, v0
>>> +; GFX7-NEXT:    v_mov_b32_e32 v1, s9
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s6, v1, v0
>>>  ; GFX7-NEXT:    buffer_store_short v0, off, s[0:3], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>>
>>> Modified: llvm/trunk/test/CodeGen/AMDGPU/idot4u.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/idot4u.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AMDGPU/idot4u.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AMDGPU/idot4u.ll Tue Jul 23 05:39:08 2019
>>> @@ -1802,33 +1802,23 @@ define amdgpu_kernel void @udot4_acc16_v
>>>  ; GFX7-NEXT:    buffer_load_ushort v0, off, s[0:3], 0
>>>  ; GFX7-NEXT:    s_load_dword s5, s[6:7], 0x0
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX7-NEXT:    s_and_b32 s11, s4, s8
>>> -; GFX7-NEXT:    s_bfe_u32 s6, s4, 0x80008
>>> -; GFX7-NEXT:    s_bfe_u32 s9, s5, 0x80008
>>> -; GFX7-NEXT:    s_lshr_b32 s10, s5, 24
>>> -; GFX7-NEXT:    s_and_b32 s8, s5, s8
>>> -; GFX7-NEXT:    v_mov_b32_e32 v4, s9
>>> -; GFX7-NEXT:    s_lshr_b32 s7, s4, 24
>>> -; GFX7-NEXT:    v_mov_b32_e32 v2, s10
>>> -; GFX7-NEXT:    s_bfe_u32 s5, s5, 0x80010
>>> -; GFX7-NEXT:    v_mov_b32_e32 v3, s8
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v2, s7, v2
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v4, s6, v4
>>> -; GFX7-NEXT:    s_bfe_u32 s4, s4, 0x80010
>>> +; GFX7-NEXT:    s_lshr_b32 s6, s4, 24
>>> +; GFX7-NEXT:    s_bfe_u32 s7, s4, 0x80008
>>> +; GFX7-NEXT:    s_bfe_u32 s10, s5, 0x80008
>>> +; GFX7-NEXT:    s_bfe_u32 s12, s5, 0x80010
>>> +; GFX7-NEXT:    s_lshr_b32 s9, s5, 24
>>> +; GFX7-NEXT:    s_and_b32 s5, s5, s8
>>> +; GFX7-NEXT:    s_bfe_u32 s11, s4, 0x80010
>>> +; GFX7-NEXT:    s_and_b32 s4, s4, s8
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v1, s5
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v1, s4, v1
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v3, s11, v3
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v3, v4
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v3, 16, v2
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v4, 16, v1
>>> +; GFX7-NEXT:    v_mov_b32_e32 v2, s10
>>> +; GFX7-NEXT:    v_mov_b32_e32 v3, s12
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s4, v1, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s7, v2, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s11, v3, v0
>>> +; GFX7-NEXT:    v_mov_b32_e32 v1, s9
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s6, v1, v0
>>>  ; GFX7-NEXT:    buffer_store_short v0, off, s[0:3], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -2023,23 +2013,23 @@ define amdgpu_kernel void @udot4_acc8_ve
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v1, s9, v1
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v2, s7, v2
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v3, s6, v3
>>> -; GFX7-NEXT:    s_and_b32 s4, s4, s8
>>> +; GFX7-NEXT:    s_and_b32 s5, s4, s8
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
>>>  ; GFX7-NEXT:    v_and_b32_e32 v2, s8, v2
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
>>>  ; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, s4, v3
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, s5, v3
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
>>>  ; GFX7-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>>  ; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>>  ; GFX7-NEXT:    v_lshrrev_b32_e32 v2, 8, v1
>>>  ; GFX7-NEXT:    v_lshrrev_b32_e32 v3, 16, v1
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v4, 24, v1
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v1, 24, v1
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, s4, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v2, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>>  ; GFX7-NEXT:    buffer_store_byte v0, off, s[0:3], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -2055,31 +2045,32 @@ define amdgpu_kernel void @udot4_acc8_ve
>>>  ; GFX8-NEXT:    s_load_dword s0, s[4:5], 0x0
>>>  ; GFX8-NEXT:    s_load_dword s1, s[6:7], 0x0
>>>  ; GFX8-NEXT:    s_waitcnt lgkmcnt(0)
>>> +; GFX8-NEXT:    v_mov_b32_e32 v3, s0
>>> +; GFX8-NEXT:    v_mov_b32_e32 v4, s1
>>> +; GFX8-NEXT:    s_and_b32 s7, s1, s8
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s0, 24
>>>  ; GFX8-NEXT:    s_lshr_b32 s3, s1, 24
>>>  ; GFX8-NEXT:    s_bfe_u32 s6, s1, 0x80010
>>> -; GFX8-NEXT:    s_and_b32 s7, s1, s8
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s0
>>> -; GFX8-NEXT:    v_mov_b32_e32 v4, s1
>>>  ; GFX8-NEXT:    v_mul_u32_u24_sdwa v3, v3, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:BYTE_1
>>> -; GFX8-NEXT:    s_bfe_u32 s4, s0, 0x80010
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX8-NEXT:    s_and_b32 s5, s0, s8
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s7
>>> +; GFX8-NEXT:    v_mul_u32_u24_e32 v4, s5, v4
>>> +; GFX8-NEXT:    s_bfe_u32 s4, s0, 0x80010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v6, s3
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v7, s2
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v4, s5, v4
>>> +; GFX8-NEXT:    v_or_b32_sdwa v3, v4, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_mul_u32_u24_e32 v5, s4, v5
>>>  ; GFX8-NEXT:    v_mul_u32_u24_sdwa v6, v7, v6 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX8-NEXT:    v_and_b32_e32 v3, 0xffff, v3
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v5, v5, v6 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_sdwa v3, v4, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_sdwa v3, v3, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_lshrrev_b32_e32 v4, 8, v3
>>> +; GFX8-NEXT:    v_or_b32_e32 v4, v3, v5
>>> +; GFX8-NEXT:    v_lshrrev_b32_e32 v5, 8, v4
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v3
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v4, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
>>> +; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v5
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX8-NEXT:    flat_store_byte v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>>  ;
>>> @@ -2101,20 +2092,21 @@ define amdgpu_kernel void @udot4_acc8_ve
>>>  ; GFX9-NODL-NEXT:    s_lshr_b32 s4, s3, 24
>>>  ; GFX9-NODL-NEXT:    v_mul_lo_u16_e32 v3, s2, v3
>>>  ; GFX9-NODL-NEXT:    v_mul_lo_u16_sdwa v4, s2, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:BYTE_1
>>> -; GFX9-NODL-NEXT:    v_mov_b32_e32 v5, s1
>>>  ; GFX9-NODL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-NODL-NEXT:    v_mov_b32_e32 v5, s1
>>>  ; GFX9-NODL-NEXT:    s_lshr_b32 s5, s2, 24
>>>  ; GFX9-NODL-NEXT:    v_mov_b32_e32 v4, s4
>>>  ; GFX9-NODL-NEXT:    v_mul_lo_u16_sdwa v4, s5, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-NODL-NEXT:    v_mul_lo_u16_e32 v5, s0, v5
>>> +; GFX9-NODL-NEXT:    v_and_b32_e32 v3, 0xffff, v3
>>>  ; GFX9-NODL-NEXT:    v_or_b32_sdwa v4, v5, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-NODL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-NODL-NEXT:    v_lshrrev_b32_e32 v4, 8, v3
>>> +; GFX9-NODL-NEXT:    v_or_b32_e32 v4, v3, v4
>>> +; GFX9-NODL-NEXT:    v_lshrrev_b32_e32 v5, 8, v4
>>>  ; GFX9-NODL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NODL-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-NODL-NEXT:    v_add_u32_e32 v2, v2, v4
>>> -; GFX9-NODL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-NODL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX9-NODL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> +; GFX9-NODL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> +; GFX9-NODL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-NODL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-NODL-NEXT:    s_endpgm
>>>  ;
>>> @@ -2136,20 +2128,21 @@ define amdgpu_kernel void @udot4_acc8_ve
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s4, s3, 24
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v3, s2, v3
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v4, s2, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:BYTE_1
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s1
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s1
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s5, s2, 24
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s4
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v4, s5, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v5, s0, v5
>>> +; GFX9-DL-NEXT:    v_and_b32_e32 v3, 0xffff, v3
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v4, v5, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v4, 8, v3
>>> +; GFX9-DL-NEXT:    v_or_b32_e32 v4, v3, v4
>>> +; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v5, 8, v4
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v4
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>> @@ -2167,27 +2160,28 @@ define amdgpu_kernel void @udot4_acc8_ve
>>>  ; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s1
>>>  ; GFX10-DL-NEXT:    global_load_ubyte v3, v[0:1], off
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s0, s3, 24
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s5, s4, 24
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s1, s3, 16
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s6, s4, 16
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v4, s3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v5, s4, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_1 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s0, s3, 24
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s1, s3, 16
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v6, s3, s4
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, s0, s5
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v8, s1, s6
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s3, s4, 16
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v4, v4, v5
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 24
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v6, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v6, v7, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v7, v8, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v4, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v7, v6 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v4, 8, v2
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v6, s1, s3
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v4, v4, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, s0, s4
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v5, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v6, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v7, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v4, 0xffff, v4
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v5, v2 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_or_b32_e32 v2, v4, v2
>>> +; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v5, 8, v2
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v2, v3
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v4
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v4, v3
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v5
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>
>>> Modified: llvm/trunk/test/CodeGen/AMDGPU/idot8s.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/idot8s.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AMDGPU/idot8s.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AMDGPU/idot8s.ll Tue Jul 23 05:39:08 2019
>>> @@ -331,39 +331,38 @@ define amdgpu_kernel void @idot8_acc16(<
>>>  ; GFX8-NEXT:    s_bfe_i32 s1, s4, 0x40000
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_i32 s5, s4, 0x40004
>>> -; GFX8-NEXT:    s_bfe_i32 s6, s4, 0x40008
>>>  ; GFX8-NEXT:    s_lshr_b32 s1, s2, 12
>>> -; GFX8-NEXT:    s_lshr_b32 s7, s4, 12
>>> -; GFX8-NEXT:    s_bfe_i32 s8, s2, 0x40004
>>> -; GFX8-NEXT:    s_bfe_i32 s9, s2, 0x40008
>>> -; GFX8-NEXT:    v_mov_b32_e32 v4, s6
>>> -; GFX8-NEXT:    v_mov_b32_e32 v7, s5
>>> +; GFX8-NEXT:    s_lshr_b32 s6, s4, 12
>>> +; GFX8-NEXT:    s_bfe_i32 s8, s4, 0x40008
>>> +; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>> +; GFX8-NEXT:    s_bfe_i32 s7, s2, 0x40004
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v5, 12, s1
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v6, 12, s7
>>> -; GFX8-NEXT:    v_mul_i32_i24_e32 v4, s9, v4
>>> -; GFX8-NEXT:    s_bfe_i32 s10, s4, 0x40010
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v6, 12, s6
>>> +; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX8-NEXT:    s_bfe_i32 s5, s2, 0x40008
>>> +; GFX8-NEXT:    s_bfe_i32 s1, s4, 0x40010
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>> -; GFX8-NEXT:    s_bfe_i32 s12, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_i32 s11, s2, 0x40010
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX8-NEXT:    s_bfe_i32 s14, s4, 0x40018
>>> -; GFX8-NEXT:    s_bfe_i32 s13, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s12
>>> -; GFX8-NEXT:    s_bfe_i32 s15, s2, 0x40018
>>> +; GFX8-NEXT:    s_bfe_i32 s8, s4, 0x40014
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s1
>>> +; GFX8-NEXT:    s_bfe_i32 s6, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_i32 s9, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s8
>>> +; GFX8-NEXT:    s_bfe_i32 s1, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_i32 s8, s2, 0x40018
>>>  ; GFX8-NEXT:    s_ashr_i32 s4, s4, 28
>>> -; GFX8-NEXT:    v_mov_b32_e32 v10, s14
>>> +; GFX8-NEXT:    v_mov_b32_e32 v10, s9
>>>  ; GFX8-NEXT:    s_ashr_i32 s2, s2, 28
>>> +; GFX8-NEXT:    v_mov_b32_e32 v11, s4
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_i32_i24 v2, s0, v3, v2
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v4, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s7, v4, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s5, v7, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, v5, v6, v2
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s11, v8, v2
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s13, v9, v2
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s15, v10, v2
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s2, v3, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s6, v8, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s1, v9, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s8, v10, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s2, v11, v2
>>>  ; GFX8-NEXT:    flat_store_short v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>>  ;
>>> @@ -382,39 +381,38 @@ define amdgpu_kernel void @idot8_acc16(<
>>>  ; GFX9-NEXT:    s_bfe_i32 s1, s4, 0x40000
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_i32 s5, s4, 0x40004
>>> -; GFX9-NEXT:    s_bfe_i32 s6, s4, 0x40008
>>>  ; GFX9-NEXT:    s_lshr_b32 s1, s2, 12
>>> -; GFX9-NEXT:    s_lshr_b32 s7, s4, 12
>>> -; GFX9-NEXT:    s_bfe_i32 s8, s2, 0x40004
>>> -; GFX9-NEXT:    s_bfe_i32 s9, s2, 0x40008
>>> -; GFX9-NEXT:    v_mov_b32_e32 v4, s6
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s5
>>> +; GFX9-NEXT:    s_lshr_b32 s6, s4, 12
>>> +; GFX9-NEXT:    s_bfe_i32 s8, s4, 0x40008
>>> +; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>> +; GFX9-NEXT:    s_bfe_i32 s7, s2, 0x40004
>>>  ; GFX9-NEXT:    v_lshlrev_b16_e64 v5, 12, s1
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v6, 12, s7
>>> -; GFX9-NEXT:    v_mul_i32_i24_e32 v4, s9, v4
>>> -; GFX9-NEXT:    s_bfe_i32 s10, s4, 0x40010
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v6, 12, s6
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-NEXT:    s_bfe_i32 s5, s2, 0x40008
>>> +; GFX9-NEXT:    s_bfe_i32 s1, s4, 0x40010
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>> -; GFX9-NEXT:    s_bfe_i32 s12, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_i32 s11, s2, 0x40010
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-NEXT:    s_bfe_i32 s14, s4, 0x40018
>>> -; GFX9-NEXT:    s_bfe_i32 s13, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s12
>>> -; GFX9-NEXT:    s_bfe_i32 s15, s2, 0x40018
>>> +; GFX9-NEXT:    s_bfe_i32 s8, s4, 0x40014
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s1
>>> +; GFX9-NEXT:    s_bfe_i32 s6, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_i32 s9, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s8
>>> +; GFX9-NEXT:    s_bfe_i32 s1, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_i32 s8, s2, 0x40018
>>>  ; GFX9-NEXT:    s_ashr_i32 s4, s4, 28
>>> -; GFX9-NEXT:    v_mov_b32_e32 v10, s14
>>> +; GFX9-NEXT:    v_mov_b32_e32 v10, s9
>>>  ; GFX9-NEXT:    s_ashr_i32 s2, s2, 28
>>> +; GFX9-NEXT:    v_mov_b32_e32 v11, s4
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_i32_i24 v2, s0, v3, v2
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s7, v4, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s5, v7, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, v5, v6, v2
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s11, v8, v2
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s13, v9, v2
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s15, v10, v2
>>> -; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s2, v3, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s6, v8, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s1, v9, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s8, v10, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s2, v11, v2
>>>  ; GFX9-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX9-NEXT:    s_endpgm
>>>  ;
>>> @@ -433,39 +431,38 @@ define amdgpu_kernel void @idot8_acc16(<
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s1, s4, 0x40000
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s5, s4, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s6, s4, 0x40008
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s1, s2, 12
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s7, s4, 12
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s8, s2, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s9, s2, 0x40008
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s6
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s5
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s6, s4, 12
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s8, s4, 0x40008
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s7, s2, 0x40004
>>>  ; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v5, 12, s1
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v6, 12, s7
>>> -; GFX9-DL-NEXT:    v_mul_i32_i24_e32 v4, s9, v4
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s10, s4, 0x40010
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v6, 12, s6
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s5, s2, 0x40008
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s1, s4, 0x40010
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s12, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s11, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s14, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s13, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s12
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s15, s2, 0x40018
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s8, s4, 0x40014
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s1
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s6, s2, 0x40010
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s9, s4, 0x40018
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s8
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s1, s2, 0x40014
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s8, s2, 0x40018
>>>  ; GFX9-DL-NEXT:    s_ashr_i32 s4, s4, 28
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v10, s14
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v10, s9
>>>  ; GFX9-DL-NEXT:    s_ashr_i32 s2, s2, 28
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v11, s4
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s0, v3, v2
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s7, v4, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s5, v7, v2
>>>  ; GFX9-DL-NEXT:    v_mad_u32_u24 v2, v5, v6, v2
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s11, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s13, v9, v2
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s15, v10, v2
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s2, v3, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s6, v8, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s1, v9, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s8, v10, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s2, v11, v2
>>>  ; GFX9-DL-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>> @@ -496,26 +493,25 @@ define amdgpu_kernel void @idot8_acc16(<
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v5, v5, v2
>>>  ; GFX10-DL-NEXT:    s_bfe_i32 s9, s2, 0x40010
>>>  ; GFX10-DL-NEXT:    s_bfe_i32 s10, s4, 0x40010
>>> -; GFX10-DL-NEXT:    v_mul_i32_i24_e64 v6, s1, s8
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s11, s2, 0x40014
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v4, 12, v4
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v5, 12, v5
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s1, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s8, s4, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s11, s2, 0x40018
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s12, s4, 0x40014
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s13, s2, 0x40018
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s14, s4, 0x40018
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v4, v4, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v2, v5, v2
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s12, s4, 0x40018
>>> -; GFX10-DL-NEXT:    s_ashr_i32 s2, s2, 28
>>> -; GFX10-DL-NEXT:    s_ashr_i32 s4, s4, 28
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s5, s6, v3
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s7, s0, v3
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v6 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    s_ashr_i32 s0, s2, 28
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s1, s8, v3
>>> +; GFX10-DL-NEXT:    s_ashr_i32 s1, s4, 28
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, v4, v2, v3
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s9, s10, v2
>>> -; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s1, s8, v2
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s11, s12, v2
>>> -; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s2, s4, v2
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s13, s14, v2
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>>                                         <8 x i4> addrspace(1)* %src2,
>>> @@ -668,21 +664,20 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX8-NEXT:    s_bfe_i32 s7, s1, 0x40000
>>>  ; GFX8-NEXT:    s_lshr_b32 s5, s1, 12
>>>  ; GFX8-NEXT:    s_bfe_i32 s9, s1, 0x40004
>>> -; GFX8-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX8-NEXT:    s_bfe_i32 s6, s0, 0x40000
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v4, 12, s4
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v5, 12, s5
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s7
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v3, 12, s4
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v4, 12, s5
>>> +; GFX8-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX8-NEXT:    s_bfe_i32 s8, s0, 0x40004
>>> -; GFX8-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s11
>>> -; GFX8-NEXT:    v_mov_b32_e32 v7, s9
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s9
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v3, 12, v3
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v4, 12, v4
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>> -; GFX8-NEXT:    v_mul_i32_i24_e32 v3, s10, v3
>>> +; GFX8-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> +; GFX8-NEXT:    v_mov_b32_e32 v7, s11
>>>  ; GFX8-NEXT:    s_bfe_i32 s13, s1, 0x40010
>>> +; GFX8-NEXT:    v_and_b32_e32 v3, s2, v3
>>>  ; GFX8-NEXT:    v_and_b32_e32 v4, s2, v4
>>> -; GFX8-NEXT:    v_and_b32_e32 v5, s2, v5
>>>  ; GFX8-NEXT:    s_bfe_i32 s15, s1, 0x40014
>>>  ; GFX8-NEXT:    s_bfe_i32 s12, s0, 0x40010
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v8, s13
>>> @@ -694,10 +689,10 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v10, s17
>>>  ; GFX8-NEXT:    s_ashr_i32 s0, s0, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s6, v6, v2
>>> -; GFX8-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, v4, v5, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s6, v5, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s8, v6, v2
>>> +; GFX8-NEXT:    v_mad_i32_i24 v2, s10, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, v3, v4, v2
>>>  ; GFX8-NEXT:    v_mad_i32_i24 v2, s12, v8, v2
>>>  ; GFX8-NEXT:    v_mad_i32_i24 v2, s14, v9, v2
>>>  ; GFX8-NEXT:    v_mad_i32_i24 v2, s16, v10, v2
>>> @@ -722,21 +717,20 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX9-NEXT:    s_bfe_i32 s7, s1, 0x40000
>>>  ; GFX9-NEXT:    s_lshr_b32 s5, s1, 12
>>>  ; GFX9-NEXT:    s_bfe_i32 s9, s1, 0x40004
>>> -; GFX9-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX9-NEXT:    s_bfe_i32 s6, s0, 0x40000
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v4, 12, s4
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v5, 12, s5
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s7
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v3, 12, s4
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v4, 12, s5
>>> +; GFX9-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX9-NEXT:    s_bfe_i32 s8, s0, 0x40004
>>> -; GFX9-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> -; GFX9-NEXT:    v_mov_b32_e32 v3, s11
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s9
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s9
>>> +; GFX9-NEXT:    v_ashrrev_i16_e32 v3, 12, v3
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v4, 12, v4
>>> -; GFX9-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>> -; GFX9-NEXT:    v_mul_i32_i24_e32 v3, s10, v3
>>> +; GFX9-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s11
>>>  ; GFX9-NEXT:    s_bfe_i32 s13, s1, 0x40010
>>> +; GFX9-NEXT:    v_and_b32_e32 v3, s2, v3
>>>  ; GFX9-NEXT:    v_and_b32_e32 v4, s2, v4
>>> -; GFX9-NEXT:    v_and_b32_e32 v5, s2, v5
>>>  ; GFX9-NEXT:    s_bfe_i32 s15, s1, 0x40014
>>>  ; GFX9-NEXT:    s_bfe_i32 s12, s0, 0x40010
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v8, s13
>>> @@ -748,10 +742,10 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v10, s17
>>>  ; GFX9-NEXT:    s_ashr_i32 s0, s0, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s6, v6, v2
>>> -; GFX9-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, v4, v5, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s6, v5, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s8, v6, v2
>>> +; GFX9-NEXT:    v_mad_i32_i24 v2, s10, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, v3, v4, v2
>>>  ; GFX9-NEXT:    v_mad_i32_i24 v2, s12, v8, v2
>>>  ; GFX9-NEXT:    v_mad_i32_i24 v2, s14, v9, v2
>>>  ; GFX9-NEXT:    v_mad_i32_i24 v2, s16, v10, v2
>>> @@ -776,21 +770,20 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s7, s1, 0x40000
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s5, s1, 12
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s9, s1, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s6, s0, 0x40000
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s7
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v4, 12, s4
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v5, 12, s5
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s7
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v3, 12, s4
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v4, 12, s5
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s11, s1, 0x40008
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s8, s0, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s11
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s9
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s9
>>> +; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v3, 12, v3
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v4, 12, v4
>>> -; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>> -; GFX9-DL-NEXT:    v_mul_i32_i24_e32 v3, s10, v3
>>> +; GFX9-DL-NEXT:    s_bfe_i32 s10, s0, 0x40008
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s11
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s13, s1, 0x40010
>>> +; GFX9-DL-NEXT:    v_and_b32_e32 v3, s2, v3
>>>  ; GFX9-DL-NEXT:    v_and_b32_e32 v4, s2, v4
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v5, s2, v5
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s15, s1, 0x40014
>>>  ; GFX9-DL-NEXT:    s_bfe_i32 s12, s0, 0x40010
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s13
>>> @@ -802,10 +795,10 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v10, s17
>>>  ; GFX9-DL-NEXT:    s_ashr_i32 s0, s0, 28
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s6, v6, v2
>>> -; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s8, v7, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, v4, v5, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s6, v5, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s8, v6, v2
>>> +; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s10, v7, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, v3, v4, v2
>>>  ; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s12, v8, v2
>>>  ; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s14, v9, v2
>>>  ; GFX9-DL-NEXT:    v_mad_i32_i24 v2, s16, v10, v2
>>> @@ -842,26 +835,25 @@ define amdgpu_kernel void @idot8_acc8(<8
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v2, v5, v2
>>>  ; GFX10-DL-NEXT:    s_bfe_i32 s10, s4, 0x40010
>>>  ; GFX10-DL-NEXT:    s_bfe_i32 s11, s5, 0x40010
>>> -; GFX10-DL-NEXT:    v_mul_i32_i24_e64 v5, s1, s9
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s12, s4, 0x40014
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v4, 12, v4
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v2, 12, v2
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s1, s4, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s9, s5, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s12, s4, 0x40018
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s13, s5, 0x40014
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s14, s4, 0x40018
>>> +; GFX10-DL-NEXT:    s_bfe_i32 s15, s5, 0x40018
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v4, v4, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v2, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    s_bfe_i32 s2, s5, 0x40018
>>> -; GFX10-DL-NEXT:    s_ashr_i32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    s_ashr_i32 s5, s5, 28
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s6, s7, v3
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s8, s0, v3
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX10-DL-NEXT:    s_ashr_i32 s0, s4, 28
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v3, s1, s9, v3
>>> +; GFX10-DL-NEXT:    s_ashr_i32 s1, s5, 28
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, v4, v2, v3
>>>  ; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s10, s11, v2
>>> -; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s1, s9, v2
>>> -; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s12, s2, v2
>>> -; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s4, s5, v2
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s12, s13, v2
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s14, s15, v2
>>> +; GFX10-DL-NEXT:    v_mad_i32_i24 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>>                                         <8 x i4> addrspace(1)* %src2,
>>> @@ -1582,69 +1574,57 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX7-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0xd
>>>  ; GFX7-NEXT:    s_mov_b32 s7, 0xf000
>>>  ; GFX7-NEXT:    s_mov_b32 s6, -1
>>> -; GFX7-NEXT:    s_mov_b32 s0, 0xffff
>>> +; GFX7-NEXT:    s_mov_b32 s2, 0xffff
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX7-NEXT:    s_load_dword s1, s[8:9], 0x0
>>> +; GFX7-NEXT:    s_load_dword s0, s[8:9], 0x0
>>>  ; GFX7-NEXT:    buffer_load_ushort v0, off, s[4:7], 0
>>> -; GFX7-NEXT:    s_load_dword s2, s[10:11], 0x0
>>> +; GFX7-NEXT:    s_load_dword s1, s[10:11], 0x0
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX7-NEXT:    s_bfe_i32 s8, s1, 0x40010
>>> -; GFX7-NEXT:    s_bfe_i32 s9, s1, 0x40014
>>> -; GFX7-NEXT:    s_bfe_i32 s15, s2, 0x40010
>>> -; GFX7-NEXT:    s_bfe_i32 s16, s2, 0x40014
>>> -; GFX7-NEXT:    s_bfe_i32 s17, s2, 0x40018
>>> -; GFX7-NEXT:    s_ashr_i32 s18, s2, 28
>>> -; GFX7-NEXT:    s_bfe_i32 s19, s2, 0x40000
>>> -; GFX7-NEXT:    s_bfe_i32 s20, s2, 0x40004
>>> -; GFX7-NEXT:    s_bfe_i32 s21, s2, 0x40008
>>> -; GFX7-NEXT:    s_bfe_i32 s2, s2, 0x4000c
>>> -; GFX7-NEXT:    s_bfe_i32 s10, s1, 0x40018
>>> -; GFX7-NEXT:    s_ashr_i32 s11, s1, 28
>>> -; GFX7-NEXT:    s_bfe_i32 s12, s1, 0x40000
>>> +; GFX7-NEXT:    s_ashr_i32 s8, s0, 28
>>> +; GFX7-NEXT:    s_bfe_i32 s9, s0, 0x40018
>>> +; GFX7-NEXT:    s_bfe_i32 s16, s1, 0x40018
>>> +; GFX7-NEXT:    s_bfe_i32 s17, s1, 0x40014
>>> +; GFX7-NEXT:    s_bfe_i32 s18, s1, 0x40010
>>> +; GFX7-NEXT:    s_bfe_i32 s19, s1, 0x40000
>>> +; GFX7-NEXT:    s_bfe_i32 s20, s1, 0x40004
>>> +; GFX7-NEXT:    s_bfe_i32 s21, s1, 0x40008
>>> +; GFX7-NEXT:    s_ashr_i32 s15, s1, 28
>>> +; GFX7-NEXT:    s_bfe_i32 s1, s1, 0x4000c
>>> +; GFX7-NEXT:    s_bfe_i32 s10, s0, 0x40014
>>> +; GFX7-NEXT:    s_bfe_i32 s11, s0, 0x40010
>>> +; GFX7-NEXT:    s_bfe_i32 s12, s0, 0x40000
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v4, s19
>>> -; GFX7-NEXT:    s_bfe_i32 s13, s1, 0x40004
>>> +; GFX7-NEXT:    s_bfe_i32 s13, s0, 0x40004
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v3, s20
>>> -; GFX7-NEXT:    s_bfe_i32 s14, s1, 0x40008
>>> +; GFX7-NEXT:    s_bfe_i32 s14, s0, 0x40008
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v2, s21
>>> -; GFX7-NEXT:    s_bfe_i32 s1, s1, 0x4000c
>>> -; GFX7-NEXT:    v_mov_b32_e32 v1, s2
>>> -; GFX7-NEXT:    v_mov_b32_e32 v5, s18
>>> -; GFX7-NEXT:    v_mov_b32_e32 v6, s17
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v1, s1, v1
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v2, s14, v2
>>> +; GFX7-NEXT:    s_bfe_i32 s0, s0, 0x4000c
>>> +; GFX7-NEXT:    v_mov_b32_e32 v1, s1
>>> +; GFX7-NEXT:    v_mul_i32_i24_e32 v1, s0, v1
>>> +; GFX7-NEXT:    v_mul_i32_i24_e32 v8, s14, v2
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v3, s13, v3
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v4, s12, v4
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v5, s11, v5
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v6, s10, v6
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
>>> -; GFX7-NEXT:    v_and_b32_e32 v2, s0, v2
>>> +; GFX7-NEXT:    v_and_b32_e32 v8, s2, v8
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
>>> -; GFX7-NEXT:    v_and_b32_e32 v4, s0, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v4, v3
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
>>> -; GFX7-NEXT:    v_and_b32_e32 v6, s0, v6
>>> +; GFX7-NEXT:    v_and_b32_e32 v4, s2, v4
>>> +; GFX7-NEXT:    v_or_b32_e32 v3, v4, v3
>>> +; GFX7-NEXT:    v_or_b32_e32 v1, v8, v1
>>> +; GFX7-NEXT:    v_alignbit_b32 v4, v1, v3, 16
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v1, 16, v1
>>> +; GFX7-NEXT:    v_mov_b32_e32 v5, s18
>>> +; GFX7-NEXT:    v_mov_b32_e32 v6, s17
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v7, s16
>>> -; GFX7-NEXT:    v_mov_b32_e32 v8, s15
>>> -; GFX7-NEXT:    v_or_b32_e32 v3, v6, v5
>>> -; GFX7-NEXT:    v_alignbit_b32 v5, v1, v2, 16
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v7, s9, v7
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v8, s8, v8
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v7, 16, v7
>>> -; GFX7-NEXT:    v_and_b32_e32 v8, s0, v8
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v4, v8, v7
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v4
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 16, v3
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v5, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v6, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v7, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v8, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s14, v2, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s11, v5, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s10, v6, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s9, v7, v0
>>> +; GFX7-NEXT:    v_mov_b32_e32 v1, s15
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s8, v1, v0
>>>  ; GFX7-NEXT:    buffer_store_short v0, off, s[4:7], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -1662,26 +1642,25 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v3, 12, s2
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v4, 12, s4
>>>  ; GFX8-NEXT:    s_lshr_b32 s0, s2, 4
>>> -; GFX8-NEXT:    s_lshr_b32 s1, s2, 8
>>> -; GFX8-NEXT:    s_lshr_b32 s5, s4, 4
>>> +; GFX8-NEXT:    s_lshr_b32 s1, s4, 4
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v5, 12, s0
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v6, 12, s1
>>> +; GFX8-NEXT:    s_lshr_b32 s5, s2, 8
>>>  ; GFX8-NEXT:    s_lshr_b32 s6, s4, 8
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v5, 12, s1
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v6, 12, s0
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v7, 12, s6
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v8, 12, s5
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v3, 12, v3
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v4, 12, v4
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v7, 12, s5
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v8, 12, s6
>>>  ; GFX8-NEXT:    s_lshr_b32 s0, s2, 12
>>>  ; GFX8-NEXT:    s_lshr_b32 s1, s4, 12
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v5, 12, v5
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v7, 12, v7
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v9, 12, s0
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v10, 12, s1
>>>  ; GFX8-NEXT:    s_lshr_b32 s5, s2, 16
>>>  ; GFX8-NEXT:    s_lshr_b32 s6, s4, 16
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v5, v5, v7
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v7, 12, v7
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v11, 12, s5
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v12, 12, s6
>>>  ; GFX8-NEXT:    s_lshr_b32 s0, s2, 20
>>> @@ -1695,26 +1674,26 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v11, 12, v11
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v12, 12, v12
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v15, 12, s5
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v17, 12, s6
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v16, 12, s6
>>>  ; GFX8-NEXT:    s_lshr_b32 s0, s2, 28
>>>  ; GFX8-NEXT:    s_lshr_b32 s1, s4, 28
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v14, 12, v14
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v16, 12, s0
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v17, 12, s0
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v18, 12, s1
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v15, 12, v15
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v17, 12, v17
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v16, 12, v16
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v17, 12, v17
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v18, 12, v18
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, v3, v4, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, v6, v8, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, v5, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, v7, v8, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, v9, v10, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, v11, v12, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, v13, v14, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, v15, v17, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, v16, v18, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, v15, v16, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, v17, v18, v2
>>>  ; GFX8-NEXT:    flat_store_short v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>>  ;
>>> @@ -1776,7 +1755,7 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v3, v2
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -1843,7 +1822,7 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v3, v2
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -1911,7 +1890,7 @@ define amdgpu_kernel void @idot8_acc16_v
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v3, v2
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v2, v5
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -1994,7 +1973,7 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v1, s2, v1
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v2, s15, v2
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v3, s14, v3
>>> -; GFX7-NEXT:    v_mul_i32_i24_e32 v4, s13, v4
>>> +; GFX7-NEXT:    v_mul_i32_i24_e32 v9, s13, v4
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v5, s12, v5
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v6, s11, v6
>>>  ; GFX7-NEXT:    v_mul_i32_i24_e32 v7, s10, v7
>>> @@ -2002,36 +1981,36 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 8, v1
>>>  ; GFX7-NEXT:    v_and_b32_e32 v2, s0, v2
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 8, v3
>>> -; GFX7-NEXT:    v_and_b32_e32 v4, s0, v4
>>> +; GFX7-NEXT:    v_and_b32_e32 v9, s0, v9
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v5, 8, v5
>>>  ; GFX7-NEXT:    v_and_b32_e32 v6, s0, v6
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v7, 8, v7
>>>  ; GFX7-NEXT:    v_and_b32_e32 v8, s0, v8
>>>  ; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v4, v3
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, v9, v3
>>>  ; GFX7-NEXT:    v_or_b32_e32 v3, v6, v5
>>> -; GFX7-NEXT:    v_or_b32_e32 v4, v8, v7
>>> +; GFX7-NEXT:    v_or_b32_e32 v5, v8, v7
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v1, 16, v1
>>>  ; GFX7-NEXT:    v_and_b32_e32 v2, s1, v2
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
>>> -; GFX7-NEXT:    v_and_b32_e32 v4, s1, v4
>>> +; GFX7-NEXT:    v_and_b32_e32 v5, s1, v5
>>>  ; GFX7-NEXT:    v_or_b32_e32 v1, v2, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v4, v3
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, v5, v3
>>>  ; GFX7-NEXT:    v_alignbit_b32 v3, v1, v2, 8
>>> -; GFX7-NEXT:    v_alignbit_b32 v4, v1, v2, 16
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v5, 24, v2
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 8, v1
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v1
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 24, v1
>>> +; GFX7-NEXT:    v_alignbit_b32 v5, v1, v2, 16
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 24, v2
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 8, v1
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 16, v1
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v1, 24, v1
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v5, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v6, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v7, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v8, v0
>>> +; GFX7-NEXT:    v_mad_i32_i24 v0, s13, v4, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v7
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v8
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>>  ; GFX7-NEXT:    buffer_store_byte v0, off, s[4:7], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -2068,55 +2047,56 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v3, v3, v7 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v4, v4, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v5, v5, v9 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v6, v6, v10 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    s_lshr_b32 s5, s4, 20
>>> +; GFX8-NEXT:    s_lshr_b32 s6, s4, 16
>>>  ; GFX8-NEXT:    s_lshr_b32 s0, s2, 20
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v6, v6, v10 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v5, v5, v9 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>>  ; GFX8-NEXT:    s_lshr_b32 s1, s2, 16
>>> -; GFX8-NEXT:    s_lshr_b32 s5, s2, 28
>>> -; GFX8-NEXT:    s_lshr_b32 s2, s2, 24
>>> -; GFX8-NEXT:    s_lshr_b32 s6, s4, 20
>>> -; GFX8-NEXT:    s_lshr_b32 s7, s4, 16
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v3, v3, v7 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v4, v4, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    s_lshr_b32 s7, s2, 28
>>>  ; GFX8-NEXT:    s_lshr_b32 s8, s4, 28
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v9, 12, s1
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v10, 12, s0
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v13, 12, s6
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v14, 12, s5
>>> +; GFX8-NEXT:    s_lshr_b32 s2, s2, 24
>>>  ; GFX8-NEXT:    s_lshr_b32 s4, s4, 24
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_sdwa v3, v5, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v7, 12, s2
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v8, 12, s5
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v9, 12, s1
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v10, 12, s0
>>> +; GFX8-NEXT:    v_lshlrev_b16_e64 v8, 12, s7
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v11, 12, s4
>>>  ; GFX8-NEXT:    v_lshlrev_b16_e64 v12, 12, s8
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v13, 12, s7
>>> -; GFX8-NEXT:    v_lshlrev_b16_e64 v14, 12, s6
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v7, 12, v7
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v11, 12, v11
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>> -; GFX8-NEXT:    v_ashrrev_i16_e32 v12, 12, v12
>>> +; GFX8-NEXT:    v_or_b32_sdwa v3, v5, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>>  ; GFX8-NEXT:    v_ashrrev_i16_e32 v14, 12, v14
>>> -; GFX8-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v7, v7, v11 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v8, v8, v12 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v7, 12, v7
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v11, 12, v11
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>> +; GFX8-NEXT:    v_ashrrev_i16_e32 v12, 12, v12
>>>  ; GFX8-NEXT:    v_mul_u32_u24_sdwa v9, v9, v13 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>>  ; GFX8-NEXT:    v_mul_u32_u24_sdwa v10, v10, v14 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    v_lshrrev_b32_e32 v6, 8, v3
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v9, v9, v10 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v7, v7, v11 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v8, v8, v12 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_0
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v7, v7, v8 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_sdwa v4, v9, v7 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_lshrrev_b32_e32 v6, 8, v4
>>> +; GFX8-NEXT:    v_and_b32_e32 v4, 0xffff, v9
>>> +; GFX8-NEXT:    v_or_b32_e32 v5, v4, v7
>>> +; GFX8-NEXT:    v_lshrrev_b32_e32 v7, 8, v5
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v3
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v5, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v6, v2
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v4
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v6
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v7, v2
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
>>>  ; GFX8-NEXT:    flat_store_byte v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>>  ;
>>> @@ -2153,55 +2133,56 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>> +; GFX9-NEXT:    s_lshr_b32 s0, s2, 20
>>> +; GFX9-NEXT:    s_lshr_b32 s5, s4, 20
>>> +; GFX9-NEXT:    s_lshr_b32 s6, s4, 16
>>> +; GFX9-NEXT:    s_lshr_b32 s1, s2, 16
>>>  ; GFX9-NEXT:    v_mul_lo_u16_e32 v6, v6, v10
>>>  ; GFX9-NEXT:    v_mul_lo_u16_sdwa v5, v5, v9 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-NEXT:    v_mul_lo_u16_sdwa v4, v4, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-NEXT:    v_mul_lo_u16_e32 v3, v3, v7
>>> -; GFX9-NEXT:    s_lshr_b32 s0, s2, 20
>>> -; GFX9-NEXT:    s_lshr_b32 s1, s2, 16
>>> -; GFX9-NEXT:    s_lshr_b32 s5, s2, 28
>>> -; GFX9-NEXT:    s_lshr_b32 s2, s2, 24
>>> -; GFX9-NEXT:    s_lshr_b32 s6, s4, 20
>>> -; GFX9-NEXT:    s_lshr_b32 s7, s4, 16
>>> +; GFX9-NEXT:    s_lshr_b32 s7, s2, 28
>>>  ; GFX9-NEXT:    s_lshr_b32 s8, s4, 28
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v10, 12, s1
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v11, 12, s0
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v14, 12, s6
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v15, 12, s5
>>> +; GFX9-NEXT:    s_lshr_b32 s2, s2, 24
>>>  ; GFX9-NEXT:    s_lshr_b32 s4, s4, 24
>>>  ; GFX9-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>>  ; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v8, 12, s2
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v9, 12, s7
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v12, 12, s4
>>> +; GFX9-NEXT:    v_lshlrev_b16_e64 v13, 12, s8
>>>  ; GFX9-NEXT:    v_or_b32_sdwa v3, v5, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v9, 12, s2
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v10, 12, s5
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v11, 12, s1
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v12, 12, s0
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v13, 12, s4
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v14, 12, s8
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v15, 12, s7
>>> -; GFX9-NEXT:    v_lshlrev_b16_e64 v16, 12, s6
>>> -; GFX9-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>> -; GFX9-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v14, 12, v14
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v11, 12, v11
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v15, 12, v15
>>> +; GFX9-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>>  ; GFX9-NEXT:    v_ashrrev_i16_e32 v12, 12, v12
>>> -; GFX9-NEXT:    v_ashrrev_i16_e32 v16, 12, v16
>>> -; GFX9-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> -; GFX9-NEXT:    v_mul_lo_u16_sdwa v12, v12, v16 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    v_mul_lo_u16_e32 v11, v11, v15
>>> -; GFX9-NEXT:    v_mul_lo_u16_sdwa v10, v10, v14 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    v_mul_lo_u16_e32 v9, v9, v13
>>> -; GFX9-NEXT:    v_or_b32_sdwa v7, v11, v12 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v10 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-NEXT:    v_or_b32_sdwa v4, v7, v8 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX9-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>> +; GFX9-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>> +; GFX9-NEXT:    v_mul_lo_u16_sdwa v11, v11, v15 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    v_mul_lo_u16_e32 v10, v10, v14
>>> +; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 8, v3
>>> +; GFX9-NEXT:    v_or_b32_sdwa v7, v10, v11 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-NEXT:    v_mul_lo_u16_sdwa v9, v9, v13 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    v_mul_lo_u16_e32 v8, v8, v12
>>> +; GFX9-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-NEXT:    v_and_b32_e32 v4, 0xffff, v7
>>> +; GFX9-NEXT:    v_or_b32_e32 v5, v4, v8
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX9-NEXT:    v_add_u32_e32 v2, v2, v6
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v2, v4
>>> -; GFX9-NEXT:    v_lshrrev_b32_e32 v3, 8, v4
>>> +; GFX9-NEXT:    v_lshrrev_b32_e32 v3, 8, v5
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v2, v3
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-NEXT:    s_endpgm
>>>  ;
>>> @@ -2238,55 +2219,56 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v6, 12, v6
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s0, s2, 20
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s5, s4, 20
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s6, s4, 16
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s1, s2, 16
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v6, v6, v10
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v5, v5, v9 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v4, v4, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v3, v3, v7
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s0, s2, 20
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s1, s2, 16
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s5, s2, 28
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 24
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s6, s4, 20
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s7, s4, 16
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s7, s2, 28
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s8, s4, 28
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v10, 12, s1
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v11, 12, s0
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v14, 12, s6
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v15, 12, s5
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 24
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 24
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v5, v6, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v8, 12, s2
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v9, 12, s7
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v12, 12, s4
>>> +; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v13, 12, s8
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v3, v5, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v9, 12, s2
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v10, 12, s5
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v11, 12, s1
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v12, 12, s0
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v13, 12, s4
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v14, 12, s8
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v15, 12, s7
>>> -; GFX9-DL-NEXT:    v_lshlrev_b16_e64 v16, 12, s6
>>> -; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>> -; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v10, 12, v10
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v14, 12, v14
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v11, 12, v11
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v15, 12, v15
>>> +; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v8, 12, v8
>>>  ; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v12, 12, v12
>>> -; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v16, 12, v16
>>> -; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v12, v12, v16 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v11, v11, v15
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v10, v10, v14 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v9, v9, v13
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v7, v11, v12 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v8, v9, v10 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v4, v7, v8 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v9, 12, v9
>>> +; GFX9-DL-NEXT:    v_ashrrev_i16_e32 v13, 12, v13
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v11, v11, v15 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v10, v10, v14
>>> +; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v6, 8, v3
>>> +; GFX9-DL-NEXT:    v_or_b32_sdwa v7, v10, v11 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v9, v9, v13 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v8, v8, v12
>>> +; GFX9-DL-NEXT:    v_or_b32_sdwa v8, v8, v9 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_and_b32_e32 v4, 0xffff, v7
>>> +; GFX9-DL-NEXT:    v_or_b32_e32 v5, v4, v8
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v6
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v4
>>> -; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v3, 8, v4
>>> +; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v3, 8, v5
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v3
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX9-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>> @@ -2332,86 +2314,87 @@ define amdgpu_kernel void @idot8_acc8_ve
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v9, 12, v9
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v19, 12, v10
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v11, 12, v11
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s0, s4, 16
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s1, s4, 20
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s7, s5, 20
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v15, 12, v6
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v7, 12, v7
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s1, s4, 20
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s6, s4, 24
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s0, s4, 16
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s6, s5, 16
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s8, s4, 24
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v12, 12, s1
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v13, 12, s0
>>>  ; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s8, s5, 20
>>>  ; GFX10-DL-NEXT:    s_lshr_b32 s9, s5, 24
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s7, s5, 16
>>>  ; GFX10-DL-NEXT:    s_lshr_b32 s5, s5, 28
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v23, v15, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v27, v15, v2
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v15, 12, s6
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v10, v19, v2
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v23, 12, s7
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v5, v5, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v8, v8, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v4, v4, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v9, v9, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v22, v7, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v11, v11, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v15, v15, v2
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v5, v5, v8
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v12, 12, s4
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v4, v4, v9
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v13, 12, s6
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v16, 12, s4
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, v22, v11
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v15, 12, s0
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v31, 12, s8
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v23, v23, v10
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v27, 12, s1
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v16, 12, s5
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v17, 12, s9
>>> -; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v19, 12, s7
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v8, v12, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v9, v13, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v11, v15, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v12, v16, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v13, v17, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v15, v19, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v10, v27, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v14, v31, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v6, v23, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v17, 12, s8
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v13, v13, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v14, v23, v2
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v23, v27, v10
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v12, v12, v2
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v31, 12, s5
>>> +; GFX10-DL-NEXT:    v_lshlrev_b16_e64 v19, 12, s9
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v8, v16, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v9, v17, v2
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v27, 12, v12
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v12, v31, v2
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v15, 12, v15
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v11, 12, v13
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v13, v19, v2
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v19, 12, v14
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v7, v7, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v6, v23, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v4, v4, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v5, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v9, 12, v9
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v12, 12, v12
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v13, 12, v13
>>>  ; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v8, 12, v8
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v9, 12, v9
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v6, v7, v6 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v10, v27, v2
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v19, 12, v10
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v15, 12, v15
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v11, 12, v11
>>> -; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v23, 12, v14
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v5, v8, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v7, v9, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v13, v13, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v9, v11, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v12, v12, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v5, v19, v2
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v6, v7, v6 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_ashrrev_i16_e64 v13, 12, v13
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v14, v11, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v15, v15, v2
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v5, v10, v5
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v6, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v10, v15, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v8, v19, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v11, v23, v2
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, v7, v13
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v5, v5, v12
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v6, v9, v10
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v23, v8, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v9, v9, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v7, v13, v2
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v11, v14, v15
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v12, v12, v2
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v5, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v6, v9, v7
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v8, v11, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, v23, v12
>>>  ; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v9, 8, v4
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v8, v8, v11
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v7, v7, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v6, v6, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v8, v8, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v5, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v5, v6, v8 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v7, v2 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v11, v6, s2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v5, v8, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v7, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v5, 0xffff, v5
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v11, v2 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_or_b32_e32 v2, v5, v2
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v4, v3
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v9
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v4, 8, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v2
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v5
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v4
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>
>>> Modified: llvm/trunk/test/CodeGen/AMDGPU/idot8u.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/idot8u.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AMDGPU/idot8u.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AMDGPU/idot8u.ll Tue Jul 23 05:39:08 2019
>>> @@ -314,35 +314,34 @@ define amdgpu_kernel void @udot8_acc16(<
>>>  ; GFX8-NEXT:    s_and_b32 s1, s4, 15
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>> +; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>>  ; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX8-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX8-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s12
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>> +; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s14
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>> +; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    flat_store_short v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>> @@ -362,35 +361,34 @@ define amdgpu_kernel void @udot8_acc16(<
>>>  ; GFX9-NEXT:    s_and_b32 s1, s4, 15
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s6
>>> +; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>>  ; GFX9-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX9-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX9-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX9-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s12
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>> +; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX9-NEXT:    v_mov_b32_e32 v3, s14
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>> +; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX9-NEXT:    s_endpgm
>>> @@ -406,81 +404,26 @@ define amdgpu_kernel void @udot8_acc16(<
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v1, s1
>>>  ; GFX9-DL-NEXT:    global_load_ushort v2, v[0:1], off
>>>  ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX9-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX9-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s12
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s14
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>> +; GFX9-DL-NEXT:    v_dot8_u32_u4 v2, s2, v3, v2
>>>  ; GFX9-DL-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>>  ; GFX10-DL-LABEL: udot8_acc16:
>>>  ; GFX10-DL:       ; %bb.0: ; %entry
>>> -; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>> -; GFX10-DL-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x34
>>> +; GFX10-DL-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x34
>>>  ; GFX10-DL-NEXT:    ; implicit-def: $vcc_hi
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_load_dword s2, s[4:5], 0x0
>>> -; GFX10-DL-NEXT:    s_load_dword s4, s[6:7], 0x0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s1
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s4
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s5
>>> +; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>>  ; GFX10-DL-NEXT:    global_load_ushort v2, v[0:1], off
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX10-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s4, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s9, s2, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s10, s4, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s11, s2, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s12, s4, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s13, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s14, s4, 0x40014
>>> -; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s8, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s9, s10, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s11, s12, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s13, s14, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s2, s4, v2
>>> +; GFX10-DL-NEXT:    s_load_dword s0, s[4:5], 0x0
>>> +; GFX10-DL-NEXT:    s_load_dword s1, s[6:7], 0x0
>>> +; GFX10-DL-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
>>> +; GFX10-DL-NEXT:    v_dot8_u32_u4 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    global_store_short v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>>                                         <8 x i4> addrspace(1)* %src2,
>>> @@ -616,35 +559,34 @@ define amdgpu_kernel void @udot8_acc8(<8
>>>  ; GFX8-NEXT:    s_and_b32 s1, s4, 15
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>> +; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>>  ; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX8-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX8-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s12
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>> +; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 0xff, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s14
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>> +; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    flat_store_byte v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>> @@ -664,35 +606,34 @@ define amdgpu_kernel void @udot8_acc8(<8
>>>  ; GFX9-NEXT:    s_and_b32 s1, s4, 15
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s6
>>> +; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>>  ; GFX9-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX9-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX9-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX9-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s12
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>> +; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-NEXT:    v_and_b32_e32 v2, 0xff, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX9-NEXT:    v_mov_b32_e32 v3, s14
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>> +; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-NEXT:    s_endpgm
>>> @@ -708,81 +649,26 @@ define amdgpu_kernel void @udot8_acc8(<8
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v1, s1
>>>  ; GFX9-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX9-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX9-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s12
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v2, 0xff, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s14
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>> +; GFX9-DL-NEXT:    v_dot8_u32_u4 v2, s2, v3, v2
>>>  ; GFX9-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>>  ; GFX10-DL-LABEL: udot8_acc8:
>>>  ; GFX10-DL:       ; %bb.0: ; %entry
>>> -; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>> -; GFX10-DL-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x34
>>> +; GFX10-DL-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x34
>>>  ; GFX10-DL-NEXT:    ; implicit-def: $vcc_hi
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_load_dword s2, s[4:5], 0x0
>>> -; GFX10-DL-NEXT:    s_load_dword s4, s[6:7], 0x0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s1
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s4
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s5
>>> +; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>>  ; GFX10-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX10-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s4, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s9, s2, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s10, s4, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s11, s2, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s12, s4, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s13, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s14, s4, 0x40014
>>> -; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v2, 0xff, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s8, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s9, s10, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s11, s12, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s13, s14, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s2, s4, v2
>>> +; GFX10-DL-NEXT:    s_load_dword s0, s[4:5], 0x0
>>> +; GFX10-DL-NEXT:    s_load_dword s1, s[6:7], 0x0
>>> +; GFX10-DL-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
>>> +; GFX10-DL-NEXT:    v_dot8_u32_u4 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>>                                        <8 x i4> addrspace(1)* %src2,
>>> @@ -920,35 +806,32 @@ define amdgpu_kernel void @udot8_acc4(<8
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX8-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX8-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX8-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v5, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -971,35 +854,32 @@ define amdgpu_kernel void @udot8_acc4(<8
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX9-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -1017,86 +897,27 @@ define amdgpu_kernel void @udot8_acc4(<8
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v1, s1
>>>  ; GFX9-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX9-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX9-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-DL-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s13
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>> +; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> +; GFX9-DL-NEXT:    v_dot8_u32_u4 v2, s2, v3, v2
>>>  ; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>>  ; GFX9-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>>  ; GFX10-DL-LABEL: udot8_acc4:
>>>  ; GFX10-DL:       ; %bb.0: ; %entry
>>> -; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>> -; GFX10-DL-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x34
>>> +; GFX10-DL-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x34
>>>  ; GFX10-DL-NEXT:    ; implicit-def: $vcc_hi
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_load_dword s2, s[4:5], 0x0
>>> -; GFX10-DL-NEXT:    s_load_dword s4, s[6:7], 0x0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s1
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s4
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s5
>>> +; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>>  ; GFX10-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX10-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s9, s4, 0x40008
>>> -; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s4, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40010
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_mul_u32_u24_e64 v3, s8, s0
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40014
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s9, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v3, 15, v3
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v2, v3
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s2, s4, v2
>>> +; GFX10-DL-NEXT:    s_load_dword s0, s[4:5], 0x0
>>> +; GFX10-DL-NEXT:    s_load_dword s1, s[6:7], 0x0
>>> +; GFX10-DL-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
>>> +; GFX10-DL-NEXT:    v_dot8_u32_u4 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>> @@ -1219,35 +1040,32 @@ define amdgpu_kernel void @udot8_Commuta
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX8-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX8-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX8-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v5
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -1270,35 +1088,32 @@ define amdgpu_kernel void @udot8_Commuta
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX9-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v5, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -1321,35 +1136,32 @@ define amdgpu_kernel void @udot8_Commuta
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX9-DL-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-DL-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX9-DL-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-DL-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v5, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -1373,27 +1185,24 @@ define amdgpu_kernel void @udot8_Commuta
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40004
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40004
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s8, s4, 0x40008
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s9, s2, 0x4000c
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s10, s4, 0x4000c
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s11, s2, 0x40010
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s12, s4, 0x40010
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s13, s2, 0x40014
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s14, s4, 0x40014
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s4, 0x40008
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_mul_u32_u24_e64 v3, s8, s1
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40014
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s0, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v3, 15, v3
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40010
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v3, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>>  ; GFX10-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> +; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s8, v2
>>> +; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s9, s10, v2
>>> +; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s11, s12, v2
>>> +; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s13, s14, v2
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s2, s4, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -1987,53 +1796,43 @@ define amdgpu_kernel void @udot8_acc16_v
>>>  ; GFX7-NEXT:    s_bfe_u32 s20, s1, 0x4000c
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v2, s20
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v4, s18
>>> -; GFX7-NEXT:    s_bfe_u32 s14, s1, 0x40014
>>> -; GFX7-NEXT:    s_bfe_u32 s15, s1, 0x40010
>>> -; GFX7-NEXT:    s_lshr_b32 s16, s1, 28
>>> -; GFX7-NEXT:    s_bfe_u32 s17, s1, 0x40018
>>> +; GFX7-NEXT:    s_bfe_u32 s15, s1, 0x40018
>>> +; GFX7-NEXT:    s_bfe_u32 s16, s1, 0x40014
>>> +; GFX7-NEXT:    s_bfe_u32 s17, s1, 0x40010
>>>  ; GFX7-NEXT:    s_and_b32 s19, s1, 15
>>> +; GFX7-NEXT:    s_lshr_b32 s14, s1, 28
>>>  ; GFX7-NEXT:    s_bfe_u32 s1, s1, 0x40008
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v2, s13, v2
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v4, s11, v4
>>> -; GFX7-NEXT:    s_bfe_u32 s2, s0, 0x40014
>>> -; GFX7-NEXT:    s_bfe_u32 s8, s0, 0x40010
>>> -; GFX7-NEXT:    s_lshr_b32 s9, s0, 28
>>> -; GFX7-NEXT:    v_mov_b32_e32 v6, s16
>>> -; GFX7-NEXT:    s_bfe_u32 s10, s0, 0x40018
>>> +; GFX7-NEXT:    s_lshr_b32 s2, s0, 28
>>> +; GFX7-NEXT:    s_bfe_u32 s8, s0, 0x40018
>>> +; GFX7-NEXT:    s_bfe_u32 s9, s0, 0x40014
>>> +; GFX7-NEXT:    s_bfe_u32 s10, s0, 0x40010
>>>  ; GFX7-NEXT:    s_and_b32 s12, s0, 15
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v3, s19
>>>  ; GFX7-NEXT:    s_bfe_u32 s0, s0, 0x40008
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v1, s1
>>> -; GFX7-NEXT:    v_mov_b32_e32 v5, s17
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v6, s9, v6
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v1, s0, v1
>>> +; GFX7-NEXT:    v_mul_u32_u24_e32 v8, s0, v1
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v3, s12, v3
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v3, v4
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v5, s10, v5
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v6, 16, v6
>>> -; GFX7-NEXT:    v_mov_b32_e32 v8, s14
>>> -; GFX7-NEXT:    v_or_b32_e32 v3, v5, v6
>>> -; GFX7-NEXT:    v_alignbit_b32 v5, v1, v2, 16
>>> +; GFX7-NEXT:    v_or_b32_e32 v3, v3, v4
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, v8, v2
>>> +; GFX7-NEXT:    v_alignbit_b32 v4, v2, v3, 16
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
>>> +; GFX7-NEXT:    v_mov_b32_e32 v5, s17
>>> +; GFX7-NEXT:    v_mov_b32_e32 v6, s16
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v7, s15
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v8, s2, v8
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v7, s8, v7
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v8, 16, v8
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 16, v1
>>> -; GFX7-NEXT:    v_or_b32_e32 v4, v7, v8
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v4
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 16, v3
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v5, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v6, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v7, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v8, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s0, v1, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s10, v5, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s9, v6, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s8, v7, v0
>>> +; GFX7-NEXT:    v_mov_b32_e32 v1, s14
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s2, v1, v0
>>>  ; GFX7-NEXT:    buffer_store_short v0, off, s[4:7], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -2052,35 +1851,34 @@ define amdgpu_kernel void @udot8_acc16_v
>>>  ; GFX8-NEXT:    s_and_b32 s1, s4, 15
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40008
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>> +; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>>  ; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s4, 0x40018
>>> -; GFX8-NEXT:    s_lshr_b32 s14, s4, 28
>>> -; GFX8-NEXT:    s_bfe_u32 s4, s4, 0x4000c
>>> -; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x40008
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s5
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s4
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s10
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s2, 0x40018
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s12
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>> +; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 0xffff, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v5, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v6, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s11, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s13, v9, v2
>>> -; GFX8-NEXT:    v_mov_b32_e32 v3, s14
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>> +; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    flat_store_short v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>> @@ -2131,7 +1929,7 @@ define amdgpu_kernel void @udot8_acc16_v
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v3, v2
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:BYTE_0
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -2186,7 +1984,7 @@ define amdgpu_kernel void @udot8_acc16_v
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v3, v2
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:BYTE_0
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -2237,7 +2035,7 @@ define amdgpu_kernel void @udot8_acc16_v
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_pk_mul_lo_u16 v3, s1, s5
>>>  ; GFX10-DL-NEXT:    s_pack_ll_b32_b16 s1, s6, s4
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:BYTE_0
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_pk_mul_lo_u16 v4, s0, s1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v2, v3
>>> @@ -2293,64 +2091,64 @@ define amdgpu_kernel void @udot8_acc8_ve
>>>  ; GFX7-NEXT:    s_load_dword s1, s[10:11], 0x0
>>>  ; GFX7-NEXT:    s_waitcnt lgkmcnt(0)
>>>  ; GFX7-NEXT:    s_bfe_u32 s2, s0, 0x4000c
>>> -; GFX7-NEXT:    s_lshr_b32 s11, s0, 28
>>> +; GFX7-NEXT:    s_bfe_u32 s9, s0, 0x40004
>>>  ; GFX7-NEXT:    s_bfe_u32 s14, s1, 0x4000c
>>> +; GFX7-NEXT:    s_bfe_u32 s16, s1, 0x40004
>>>  ; GFX7-NEXT:    s_lshr_b32 s18, s1, 28
>>> -; GFX7-NEXT:    s_bfe_u32 s20, s1, 0x40014
>>> -; GFX7-NEXT:    v_mov_b32_e32 v4, s18
>>> +; GFX7-NEXT:    v_mov_b32_e32 v6, s16
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v8, s14
>>>  ; GFX7-NEXT:    s_bfe_u32 s15, s1, 0x40008
>>> -; GFX7-NEXT:    s_bfe_u32 s16, s1, 0x40004
>>>  ; GFX7-NEXT:    s_and_b32 s17, s1, 15
>>>  ; GFX7-NEXT:    s_bfe_u32 s19, s1, 0x40018
>>> -; GFX7-NEXT:    s_bfe_u32 s1, s1, 0x40010
>>> -; GFX7-NEXT:    s_bfe_u32 s13, s0, 0x40014
>>> -; GFX7-NEXT:    v_mov_b32_e32 v2, s20
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v2, s13, v2
>>> +; GFX7-NEXT:    s_bfe_u32 s20, s1, 0x40014
>>> +; GFX7-NEXT:    s_lshr_b32 s11, s0, 28
>>> +; GFX7-NEXT:    v_mov_b32_e32 v4, s18
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v4, s11, v4
>>> +; GFX7-NEXT:    v_mul_u32_u24_e32 v6, s9, v6
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v8, s2, v8
>>> +; GFX7-NEXT:    s_bfe_u32 s1, s1, 0x40010
>>>  ; GFX7-NEXT:    s_bfe_u32 s8, s0, 0x40008
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v7, s15
>>> -; GFX7-NEXT:    s_bfe_u32 s9, s0, 0x40004
>>> -; GFX7-NEXT:    v_mov_b32_e32 v6, s16
>>>  ; GFX7-NEXT:    s_and_b32 s10, s0, 15
>>> +; GFX7-NEXT:    v_mov_b32_e32 v5, s17
>>>  ; GFX7-NEXT:    s_bfe_u32 s12, s0, 0x40018
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v3, s19
>>> +; GFX7-NEXT:    s_bfe_u32 s13, s0, 0x40014
>>> +; GFX7-NEXT:    v_mov_b32_e32 v2, s20
>>> +; GFX7-NEXT:    v_mul_u32_u24_e32 v2, s13, v2
>>>  ; GFX7-NEXT:    s_bfe_u32 s0, s0, 0x40010
>>>  ; GFX7-NEXT:    v_mov_b32_e32 v1, s1
>>> -; GFX7-NEXT:    v_mov_b32_e32 v5, s17
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v6, s9, v6
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v1, s0, v1
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v3, s12, v3
>>> -; GFX7-NEXT:    v_mul_u32_u24_e32 v7, s8, v7
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 8, v4
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v8, 8, v8
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v3, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v4, v7, v8
>>>  ; GFX7-NEXT:    v_mul_u32_u24_e32 v5, s10, v5
>>> +; GFX7-NEXT:    v_mul_u32_u24_e32 v7, s8, v7
>>>  ; GFX7-NEXT:    v_lshlrev_b32_e32 v6, 8, v6
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
>>> -; GFX7-NEXT:    v_or_b32_e32 v3, v5, v6
>>> -; GFX7-NEXT:    v_lshlrev_b32_e32 v4, 16, v4
>>> -; GFX7-NEXT:    v_or_b32_e32 v1, v1, v2
>>> -; GFX7-NEXT:    v_or_b32_e32 v2, v3, v4
>>> -; GFX7-NEXT:    v_alignbit_b32 v3, v1, v2, 8
>>> -; GFX7-NEXT:    v_alignbit_b32 v4, v1, v2, 16
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v5, 24, v2
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 8, v1
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 16, v1
>>> -; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 24, v1
>>> +; GFX7-NEXT:    v_lshlrev_b32_e32 v8, 8, v8
>>> +; GFX7-NEXT:    v_or_b32_e32 v3, v3, v4
>>> +; GFX7-NEXT:    v_or_b32_e32 v4, v5, v6
>>> +; GFX7-NEXT:    v_or_b32_e32 v5, v7, v8
>>> +; GFX7-NEXT:    v_mul_u32_u24_e32 v9, s0, v1
>>> +; GFX7-NEXT:    v_lshlrev_b32_e32 v2, 8, v2
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, v9, v2
>>> +; GFX7-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
>>> +; GFX7-NEXT:    v_lshlrev_b32_e32 v5, 16, v5
>>> +; GFX7-NEXT:    v_or_b32_e32 v2, v2, v3
>>> +; GFX7-NEXT:    v_or_b32_e32 v3, v4, v5
>>> +; GFX7-NEXT:    v_alignbit_b32 v4, v2, v3, 8
>>> +; GFX7-NEXT:    v_alignbit_b32 v5, v2, v3, 16
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v6, 24, v3
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v7, 8, v2
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v8, 16, v2
>>> +; GFX7-NEXT:    v_lshrrev_b32_e32 v2, 24, v2
>>>  ; GFX7-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v3, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v4, v0
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v5, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v1
>>>  ; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v6, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v7, v0
>>> -; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v8, v0
>>> +; GFX7-NEXT:    v_mad_u32_u24 v0, s0, v1, v0
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v7
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v8
>>> +; GFX7-NEXT:    v_add_i32_e32 v0, vcc, v0, v2
>>>  ; GFX7-NEXT:    buffer_store_byte v0, off, s[4:7], 0
>>>  ; GFX7-NEXT:    s_endpgm
>>>  ;
>>> @@ -2383,41 +2181,42 @@ define amdgpu_kernel void @udot8_acc8_ve
>>>  ; GFX8-NEXT:    v_mul_u32_u24_e32 v3, s10, v3
>>>  ; GFX8-NEXT:    v_mul_u32_u24_e32 v5, s9, v6
>>>  ; GFX8-NEXT:    v_mul_u32_u24_sdwa v6, v8, v7 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX8-NEXT:    s_bfe_u32 s0, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40010
>>> +; GFX8-NEXT:    s_lshr_b32 s7, s2, 28
>>> +; GFX8-NEXT:    s_lshr_b32 s8, s4, 28
>>>  ; GFX8-NEXT:    v_or_b32_e32 v5, v5, v6
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_or_b32_sdwa v3, v5, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    s_bfe_u32 s0, s2, 0x40014
>>> -; GFX8-NEXT:    s_lshr_b32 s1, s2, 28
>>> -; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40010
>>> -; GFX8-NEXT:    s_lshr_b32 s7, s4, 28
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x40010
>>>  ; GFX8-NEXT:    s_bfe_u32 s4, s4, 0x40018
>>> -; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40010
>>> -; GFX8-NEXT:    s_bfe_u32 s2, s2, 0x40018
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s4
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s8
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v7, s7
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s1
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s6
>>> -; GFX8-NEXT:    v_mov_b32_e32 v10, s5
>>> -; GFX8-NEXT:    v_mov_b32_e32 v11, s0
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s5
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s1
>>> +; GFX8-NEXT:    v_mov_b32_e32 v10, s0
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v6, v7, v6 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX8-NEXT:    v_mul_u32_u24_e32 v7, s6, v8
>>> +; GFX8-NEXT:    v_mul_u32_u24_sdwa v8, v10, v9 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX8-NEXT:    s_bfe_u32 s2, s2, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v4, s4
>>>  ; GFX8-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v7, v8, v7 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v6, s2, v6
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v8, s8, v9
>>> -; GFX8-NEXT:    v_mul_u32_u24_sdwa v9, v11, v10 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_e32 v8, v8, v9
>>> -; GFX8-NEXT:    v_or_b32_sdwa v6, v6, v7 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX8-NEXT:    v_or_b32_sdwa v4, v8, v6 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX8-NEXT:    v_lshrrev_b32_e32 v6, 8, v4
>>> +; GFX8-NEXT:    v_mul_u32_u24_e32 v4, s2, v4
>>> +; GFX8-NEXT:    v_or_b32_e32 v7, v7, v8
>>> +; GFX8-NEXT:    v_or_b32_sdwa v4, v4, v6 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX8-NEXT:    v_and_b32_e32 v6, 0xffff, v7
>>> +; GFX8-NEXT:    v_or_b32_e32 v4, v6, v4
>>> +; GFX8-NEXT:    v_lshrrev_b32_e32 v7, 8, v4
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v3
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v5, v2
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
>>>  ; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v4
>>>  ; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v2, v6
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> -; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> +; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v7, v2
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v4, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:DWORD
>>> +; GFX8-NEXT:    v_add_u32_sdwa v2, vcc, v4, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_3 src1_sel:DWORD
>>>  ; GFX8-NEXT:    flat_store_byte v[0:1], v2
>>>  ; GFX8-NEXT:    s_endpgm
>>>  ;
>>> @@ -2448,35 +2247,36 @@ define amdgpu_kernel void @udot8_acc8_ve
>>>  ; GFX9-NEXT:    v_mul_lo_u16_sdwa v4, s8, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-NEXT:    v_mul_lo_u16_e32 v5, s9, v5
>>>  ; GFX9-NEXT:    v_mul_lo_u16_sdwa v6, s10, v6 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    s_bfe_u32 s0, s4, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>>  ; GFX9-NEXT:    v_or_b32_e32 v3, v3, v4
>>>  ; GFX9-NEXT:    v_or_b32_sdwa v4, v5, v6 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40018
>>> -; GFX9-NEXT:    s_bfe_u32 s0, s4, 0x40010
>>> -; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x40018
>>>  ; GFX9-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x40010
>>> -; GFX9-NEXT:    v_mov_b32_e32 v4, s0
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s1
>>> -; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40018
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s5
>>> -; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s4
>>> -; GFX9-NEXT:    v_mul_lo_u16_e32 v4, s6, v4
>>> -; GFX9-NEXT:    v_mul_lo_u16_sdwa v5, s7, v5 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    v_mul_lo_u16_e32 v6, s8, v6
>>> -; GFX9-NEXT:    v_mul_lo_u16_sdwa v7, s2, v7 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    v_or_b32_e32 v4, v4, v5
>>> -; GFX9-NEXT:    v_or_b32_sdwa v5, v6, v7 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> +; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s0
>>> +; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s1
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s7
>>> +; GFX9-NEXT:    s_lshr_b32 s1, s2, 28
>>> +; GFX9-NEXT:    v_mov_b32_e32 v10, s4
>>> +; GFX9-NEXT:    v_mul_lo_u16_e32 v7, s5, v7
>>> +; GFX9-NEXT:    v_mul_lo_u16_sdwa v8, s6, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    v_lshrrev_b32_e32 v6, 8, v3
>>> +; GFX9-NEXT:    v_or_b32_e32 v7, v7, v8
>>> +; GFX9-NEXT:    v_mul_lo_u16_e32 v9, s0, v9
>>> +; GFX9-NEXT:    v_mul_lo_u16_sdwa v10, s1, v10 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    v_or_b32_sdwa v8, v9, v10 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-NEXT:    v_and_b32_e32 v5, 0xffff, v7
>>> +; GFX9-NEXT:    v_or_b32_e32 v4, v5, v8
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX9-NEXT:    v_add_u32_e32 v2, v2, v6
>>> +; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v2, v4
>>> +; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-NEXT:    v_lshrrev_b32_e32 v3, 8, v4
>>>  ; GFX9-NEXT:    v_add_u32_e32 v2, v2, v3
>>>  ; GFX9-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -2511,35 +2311,36 @@ define amdgpu_kernel void @udot8_acc8_ve
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v4, s8, v4 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v5, s9, v5
>>>  ; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v6, s10, v6 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s0, s4, 0x40010
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>>  ; GFX9-DL-NEXT:    v_or_b32_e32 v3, v3, v4
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v4, v5, v6 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s0, s4, 0x40010
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s7, s4, 0x40018
>>>  ; GFX9-DL-NEXT:    v_or_b32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s6, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s0
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s1
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s2, 0x40018
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s5
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s4
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v4, s6, v4
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v5, s7, v5 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v6, s8, v6
>>> -; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v7, s2, v7 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_or_b32_e32 v4, v4, v5
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v5, v6, v7 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v5, 8, v3
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s0
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s5, s2, 0x40010
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s1
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s6, s2, 0x40014
>>> +; GFX9-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s7
>>> +; GFX9-DL-NEXT:    s_lshr_b32 s1, s2, 28
>>> +; GFX9-DL-NEXT:    v_mov_b32_e32 v10, s4
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v7, s5, v7
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v8, s6, v8 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v6, 8, v3
>>> +; GFX9-DL-NEXT:    v_or_b32_e32 v7, v7, v8
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_e32 v9, s0, v9
>>> +; GFX9-DL-NEXT:    v_mul_lo_u16_sdwa v10, s1, v10 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_or_b32_sdwa v8, v9, v10 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX9-DL-NEXT:    v_and_b32_e32 v5, 0xffff, v7
>>> +; GFX9-DL-NEXT:    v_or_b32_e32 v4, v5, v8
>>>  ; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v3, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v6
>>> +; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v3 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v4
>>> +; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>>  ; GFX9-DL-NEXT:    v_lshrrev_b32_e32 v3, 8, v4
>>>  ; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v3
>>>  ; GFX9-DL-NEXT:    v_add_u32_sdwa v2, v2, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>> @@ -2575,34 +2376,35 @@ define amdgpu_kernel void @udot8_acc8_ve
>>>  ; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40014
>>>  ; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v5, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>>  ; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, s8, s10
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s1, s2, 28
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40014
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v6, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s6, s4, 28
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s4, 0x40014
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40010
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s6, s2, 28
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v5, v7, v5 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s2, s2, 0x40018
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s4, s4, 0x40018
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, s0, s1
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s7, s4, 0x40010
>>> +; GFX10-DL-NEXT:    s_lshr_b32 s8, s4, 28
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v4, v4, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v5, s0, s5
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v11, s1, s6
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v8, s7, s8
>>> -; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v9, s2, s4
>>> -; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v7, 8, v4
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v5, v5, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v11, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v5, v8, v5 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v6, v7, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v5, s5, s7
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v7, s6, s8
>>> +; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v8, 8, v4
>>> +; GFX10-DL-NEXT:    v_mul_lo_u16_e64 v9, s0, s1
>>> +; GFX10-DL-NEXT:    v_or_b32_sdwa v5, v5, v6 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> +; GFX10-DL-NEXT:    v_and_b32_sdwa v2, v7, v2 dst_sel:BYTE_1
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_and_b32_e32 v5, 0xffff, v5
>>>  ; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v9, v2 dst_sel:WORD_1
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:WORD_0
>>> -; GFX10-DL-NEXT:    v_or_b32_sdwa v2, v5, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
>>> +; GFX10-DL-NEXT:    v_or_b32_e32 v2, v5, v2
>>> +; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v6, 8, v2
>>>  ; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v6, v4, v3
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v6, v7
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:BYTE_0 src1_sel:BYTE_2
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v4, v3
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v8
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v4 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>> -; GFX10-DL-NEXT:    v_lshrrev_b32_e32 v4, 8, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v4
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v5
>>> +; GFX10-DL-NEXT:    v_add_nc_u32_e32 v3, v3, v6
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v3, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
>>>  ; GFX10-DL-NEXT:    v_add_nc_u32_sdwa v2, v3, v2 dst_sel:DWORD
>>> dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_3
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>> @@ -2706,35 +2508,32 @@ define amdgpu_kernel void @udot8_acc4_ve
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX8-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX8-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX8-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX8-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX8-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX8-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX8-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX8-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX8-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX8-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX8-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX8-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX8-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX8-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX8-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX8-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX8-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX8-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX8-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX8-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX8-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX8-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX8-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX8-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX8-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX8-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX8-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX8-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX8-NEXT:    v_add_u32_e32 v2, vcc, v5, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX8-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX8-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX8-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX8-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX8-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -2757,35 +2556,32 @@ define amdgpu_kernel void @udot8_acc4_ve
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s1
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>>  ; GFX9-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v4, s5
>>>  ; GFX9-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> +; GFX9-NEXT:    v_mov_b32_e32 v5, s6
>>>  ; GFX9-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s4, 0x40010
>>> +; GFX9-NEXT:    v_mov_b32_e32 v6, s7
>>> +; GFX9-NEXT:    s_bfe_u32 s6, s2, 0x4000c
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s4, 0x40014
>>> +; GFX9-NEXT:    v_mov_b32_e32 v7, s8
>>> +; GFX9-NEXT:    s_bfe_u32 s7, s2, 0x40010
>>> +; GFX9-NEXT:    s_bfe_u32 s10, s4, 0x40018
>>> +; GFX9-NEXT:    v_mov_b32_e32 v8, s9
>>> +; GFX9-NEXT:    s_bfe_u32 s8, s2, 0x40014
>>> +; GFX9-NEXT:    s_bfe_u32 s9, s2, 0x40018
>>>  ; GFX9-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-NEXT:    v_mov_b32_e32 v9, s13
>>> +; GFX9-NEXT:    v_mov_b32_e32 v9, s10
>>>  ; GFX9-NEXT:    s_lshr_b32 s2, s2, 28
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s5, v5, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s6, v6, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s7, v7, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s8, v8, v2
>>> +; GFX9-NEXT:    v_mad_u32_u24 v2, s9, v9, v2
>>>  ; GFX9-NEXT:    v_mov_b32_e32 v3, s4
>>>  ; GFX9-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>>  ; GFX9-NEXT:    v_and_b32_e32 v2, 15, v2
>>> @@ -2803,86 +2599,27 @@ define amdgpu_kernel void @udot8_acc4_ve
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v1, s1
>>>  ; GFX9-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX9-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX9-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX9-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s1
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s4, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s6, s4, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s7, s4, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v4, s5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s1, s2, 0x40004
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s5, s2, 0x40008
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v5, s7
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v6, s6
>>> -; GFX9-DL-NEXT:    v_mul_u32_u24_e32 v5, s8, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s9, s4, 0x40010
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v5, 15, v5
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s11, s4, 0x40014
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s10, s2, 0x40010
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v7, s9
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s13, s4, 0x40018
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s12, s2, 0x40014
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v8, s11
>>> -; GFX9-DL-NEXT:    s_bfe_u32 s14, s2, 0x40018
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX9-DL-NEXT:    v_mov_b32_e32 v9, s13
>>> -; GFX9-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s0, v3, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s1, v4, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s5, v6, v2
>>> -; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX9-DL-NEXT:    v_add_u32_e32 v2, v2, v5
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s10, v7, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s12, v8, v2
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s14, v9, v2
>>>  ; GFX9-DL-NEXT:    v_mov_b32_e32 v3, s4
>>> -; GFX9-DL-NEXT:    v_mad_u32_u24 v2, s2, v3, v2
>>> +; GFX9-DL-NEXT:    s_waitcnt vmcnt(0)
>>> +; GFX9-DL-NEXT:    v_dot8_u32_u4 v2, s2, v3, v2
>>>  ; GFX9-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>>  ; GFX9-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX9-DL-NEXT:    s_endpgm
>>>  ;
>>>  ; GFX10-DL-LABEL: udot8_acc4_vecMul:
>>>  ; GFX10-DL:       ; %bb.0: ; %entry
>>> -; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>> -; GFX10-DL-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x34
>>> +; GFX10-DL-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x34
>>>  ; GFX10-DL-NEXT:    ; implicit-def: $vcc_hi
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_load_dword s2, s[4:5], 0x0
>>> -; GFX10-DL-NEXT:    s_load_dword s4, s[6:7], 0x0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s0
>>> -; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s1
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v0, s4
>>> +; GFX10-DL-NEXT:    v_mov_b32_e32 v1, s5
>>> +; GFX10-DL-NEXT:    s_load_dwordx4 s[4:7], s[0:1], 0x24
>>>  ; GFX10-DL-NEXT:    global_load_ubyte v2, v[0:1], off
>>>  ; GFX10-DL-NEXT:    s_waitcnt lgkmcnt(0)
>>> -; GFX10-DL-NEXT:    s_and_b32 s0, s2, 15
>>> -; GFX10-DL-NEXT:    s_and_b32 s1, s4, 15
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40004
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s7, s2, 0x40008
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s8, s2, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s9, s4, 0x40008
>>> -; GFX10-DL-NEXT:    s_waitcnt vmcnt(0)
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s4, 0x4000c
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40010
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_mul_u32_u24_e64 v3, s8, s0
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40010
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s5, s2, 0x40014
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s6, s4, 0x40014
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s7, s9, v2
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v3, 15, v3
>>> -; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>> -; GFX10-DL-NEXT:    v_add_nc_u32_e32 v2, v2, v3
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s0, s2, 0x40018
>>> -; GFX10-DL-NEXT:    s_bfe_u32 s1, s4, 0x40018
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s2, s2, 28
>>> -; GFX10-DL-NEXT:    s_lshr_b32 s4, s4, 28
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s5, s6, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s0, s1, v2
>>> -; GFX10-DL-NEXT:    v_mad_u32_u24 v2, s2, s4, v2
>>> +; GFX10-DL-NEXT:    s_load_dword s0, s[4:5], 0x0
>>> +; GFX10-DL-NEXT:    s_load_dword s1, s[6:7], 0x0
>>> +; GFX10-DL-NEXT:    s_waitcnt vmcnt(0) lgkmcnt(0)
>>> +; GFX10-DL-NEXT:    v_dot8_u32_u4 v2, s0, s1, v2
>>>  ; GFX10-DL-NEXT:    v_and_b32_e32 v2, 15, v2
>>>  ; GFX10-DL-NEXT:    global_store_byte v[0:1], v2, off
>>>  ; GFX10-DL-NEXT:    s_endpgm
>>>
>>> Modified: llvm/trunk/test/CodeGen/AMDGPU/sdiv.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/AMDGPU/sdiv.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/AMDGPU/sdiv.ll (original)
>>> +++ llvm/trunk/test/CodeGen/AMDGPU/sdiv.ll Tue Jul 23 05:39:08 2019
>>> @@ -1931,19 +1931,19 @@ define amdgpu_kernel void @v_sdiv_i24(i3
>>>  ; GCN-NEXT:    s_waitcnt vmcnt(1)
>>>  ; GCN-NEXT:    v_or_b32_e32 v0, v0, v1
>>>  ; GCN-NEXT:    s_waitcnt vmcnt(0)
>>> -; GCN-NEXT:    v_or_b32_e32 v1, v2, v3
>>> -; GCN-NEXT:    v_xor_b32_e32 v2, v0, v1
>>> +; GCN-NEXT:    v_or_b32_e32 v2, v2, v3
>>> +; GCN-NEXT:    v_xor_b32_e32 v1, v1, v3
>>> +; GCN-NEXT:    v_ashrrev_i32_e32 v1, 30, v1
>>>  ; GCN-NEXT:    v_cvt_f32_i32_e32 v0, v0
>>> -; GCN-NEXT:    v_cvt_f32_i32_e32 v1, v1
>>> -; GCN-NEXT:    v_ashrrev_i32_e32 v2, 30, v2
>>> -; GCN-NEXT:    v_rcp_iflag_f32_e32 v3, v1
>>> -; GCN-NEXT:    v_or_b32_e32 v2, 1, v2
>>> +; GCN-NEXT:    v_cvt_f32_i32_e32 v2, v2
>>> +; GCN-NEXT:    v_or_b32_e32 v1, 1, v1
>>> +; GCN-NEXT:    v_rcp_iflag_f32_e32 v3, v2
>>>  ; GCN-NEXT:    v_mul_f32_e32 v3, v0, v3
>>>  ; GCN-NEXT:    v_trunc_f32_e32 v3, v3
>>> -; GCN-NEXT:    v_mad_f32 v0, -v3, v1, v0
>>> +; GCN-NEXT:    v_mad_f32 v0, -v3, v2, v0
>>>  ; GCN-NEXT:    v_cvt_i32_f32_e32 v3, v3
>>> -; GCN-NEXT:    v_cmp_ge_f32_e64 vcc, |v0|, |v1|
>>> -; GCN-NEXT:    v_cndmask_b32_e32 v0, 0, v2, vcc
>>> +; GCN-NEXT:    v_cmp_ge_f32_e64 vcc, |v0|, |v2|
>>> +; GCN-NEXT:    v_cndmask_b32_e32 v0, 0, v1, vcc
>>>  ; GCN-NEXT:    v_add_i32_e32 v0, vcc, v0, v3
>>>  ; GCN-NEXT:    v_bfe_i32 v0, v0, 0, 24
>>>  ; GCN-NEXT:    buffer_store_dword v0, off, s[4:7], 0
>>> @@ -1970,21 +1970,21 @@ define amdgpu_kernel void @v_sdiv_i24(i3
>>>  ; TONGA-NEXT:    s_waitcnt vmcnt(1)
>>>  ; TONGA-NEXT:    v_lshlrev_b32_e32 v2, 16, v2
>>>  ; TONGA-NEXT:    v_or_b32_e32 v1, v1, v2
>>> -; TONGA-NEXT:    v_cvt_f32_i32_e32 v2, v1
>>> +; TONGA-NEXT:    v_cvt_f32_i32_e32 v1, v1
>>>  ; TONGA-NEXT:    s_waitcnt vmcnt(0)
>>> -; TONGA-NEXT:    v_or_b32_e32 v0, v3, v0
>>> -; TONGA-NEXT:    v_cvt_f32_i32_e32 v3, v0
>>> -; TONGA-NEXT:    v_xor_b32_e32 v0, v0, v1
>>> -; TONGA-NEXT:    v_rcp_iflag_f32_e32 v4, v2
>>> +; TONGA-NEXT:    v_or_b32_e32 v3, v3, v0
>>> +; TONGA-NEXT:    v_cvt_f32_i32_e32 v3, v3
>>> +; TONGA-NEXT:    v_xor_b32_e32 v0, v0, v2
>>> +; TONGA-NEXT:    v_rcp_iflag_f32_e32 v4, v1
>>>  ; TONGA-NEXT:    v_ashrrev_i32_e32 v0, 30, v0
>>>  ; TONGA-NEXT:    v_or_b32_e32 v0, 1, v0
>>> -; TONGA-NEXT:    v_mul_f32_e32 v1, v3, v4
>>> -; TONGA-NEXT:    v_trunc_f32_e32 v1, v1
>>> -; TONGA-NEXT:    v_mad_f32 v3, -v1, v2, v3
>>> -; TONGA-NEXT:    v_cvt_i32_f32_e32 v1, v1
>>> -; TONGA-NEXT:    v_cmp_ge_f32_e64 vcc, |v3|, |v2|
>>> +; TONGA-NEXT:    v_mul_f32_e32 v2, v3, v4
>>> +; TONGA-NEXT:    v_trunc_f32_e32 v2, v2
>>> +; TONGA-NEXT:    v_mad_f32 v3, -v2, v1, v3
>>> +; TONGA-NEXT:    v_cvt_i32_f32_e32 v2, v2
>>> +; TONGA-NEXT:    v_cmp_ge_f32_e64 vcc, |v3|, |v1|
>>>  ; TONGA-NEXT:    v_cndmask_b32_e32 v0, 0, v0, vcc
>>> -; TONGA-NEXT:    v_add_u32_e32 v0, vcc, v0, v1
>>> +; TONGA-NEXT:    v_add_u32_e32 v0, vcc, v0, v2
>>>  ; TONGA-NEXT:    v_bfe_i32 v0, v0, 0, 24
>>>  ; TONGA-NEXT:    buffer_store_dword v0, off, s[0:3], 0
>>>  ; TONGA-NEXT:    s_endpgm
>>> @@ -2011,18 +2011,18 @@ define amdgpu_kernel void @v_sdiv_i24(i3
>>>  ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>>>  ; GFX9-NEXT:    v_lshlrev_b32_e32 v3, 16, v3
>>>  ; GFX9-NEXT:    v_or_b32_e32 v2, v2, v3
>>> -; GFX9-NEXT:    v_cvt_f32_i32_e32 v3, v2
>>> -; GFX9-NEXT:    v_cvt_f32_i32_e32 v1, v0
>>> -; GFX9-NEXT:    v_xor_b32_e32 v0, v0, v2
>>> -; GFX9-NEXT:    v_ashrrev_i32_e32 v0, 30, v0
>>> -; GFX9-NEXT:    v_rcp_iflag_f32_e32 v4, v3
>>> -; GFX9-NEXT:    v_or_b32_e32 v0, 1, v0
>>> -; GFX9-NEXT:    v_mul_f32_e32 v2, v1, v4
>>> -; GFX9-NEXT:    v_trunc_f32_e32 v2, v2
>>> -; GFX9-NEXT:    v_cvt_i32_f32_e32 v4, v2
>>> -; GFX9-NEXT:    v_mad_f32 v1, -v2, v3, v1
>>> -; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v1|, |v3|
>>> -; GFX9-NEXT:    v_cndmask_b32_e32 v0, 0, v0, vcc
>>> +; GFX9-NEXT:    v_cvt_f32_i32_e32 v2, v2
>>> +; GFX9-NEXT:    v_cvt_f32_i32_e32 v0, v0
>>> +; GFX9-NEXT:    v_xor_b32_e32 v1, v1, v3
>>> +; GFX9-NEXT:    v_ashrrev_i32_e32 v1, 30, v1
>>> +; GFX9-NEXT:    v_rcp_iflag_f32_e32 v4, v2
>>> +; GFX9-NEXT:    v_or_b32_e32 v1, 1, v1
>>> +; GFX9-NEXT:    v_mul_f32_e32 v3, v0, v4
>>> +; GFX9-NEXT:    v_trunc_f32_e32 v3, v3
>>> +; GFX9-NEXT:    v_cvt_i32_f32_e32 v4, v3
>>> +; GFX9-NEXT:    v_mad_f32 v0, -v3, v2, v0
>>> +; GFX9-NEXT:    v_cmp_ge_f32_e64 vcc, |v0|, |v2|
>>> +; GFX9-NEXT:    v_cndmask_b32_e32 v0, 0, v1, vcc
>>>  ; GFX9-NEXT:    v_add_u32_e32 v0, v4, v0
>>>  ; GFX9-NEXT:    v_bfe_i32 v0, v0, 0, 24
>>>  ; GFX9-NEXT:    buffer_store_dword v0, off, s[0:3], 0
>>>
>>> Modified: llvm/trunk/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll (original)
>>> +++ llvm/trunk/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll Tue Jul
>>> 23 05:39:08 2019
>>> @@ -120,15 +120,15 @@ define void @fun2(<8 x i32> %src, <8 x i
>>>  define void @fun3(<3 x i31>* %src, <3 x i31>* %p)
>>>  ; CHECK-LABEL: fun3:
>>>  ; CHECK:       # %bb.0:
>>> -; CHECK-NEXT:    llgf %r1, 0(%r2)
>>>  ; CHECK-NEXT:    llgf %r0, 3(%r2)
>>> -; CHECK-NEXT:    sllg %r4, %r1, 62
>>> +; CHECK-NEXT:    llgf %r1, 6(%r2)
>>> +; CHECK-NEXT:    llgf %r2, 0(%r2)
>>> +; CHECK-NEXT:    rosbg %r1, %r0, 0, 32, 31
>>> +; CHECK-NEXT:    sllg %r4, %r2, 62
>>>  ; CHECK-NEXT:    rosbg %r4, %r0, 0, 32, 31
>>> -; CHECK-NEXT:    llgf %r0, 6(%r2)
>>> -; CHECK-NEXT:    ogr %r0, %r4
>>> -; CHECK-NEXT:    st %r0, 8(%r3)
>>>  ; CHECK-NEXT:    srlg %r0, %r4, 32
>>> -; CHECK-NEXT:    sllg %r1, %r1, 30
>>> +; CHECK-NEXT:    st %r1, 8(%r3)
>>> +; CHECK-NEXT:    sllg %r1, %r2, 30
>>>  ; CHECK-NEXT:    lr %r1, %r0
>>>  ; CHECK-NEXT:    nihh %r1, 8191
>>>  ; CHECK-NEXT:    stg %r1, 0(%r3)
>>>
>>> Modified: llvm/trunk/test/CodeGen/X86/2012-08-07-CmpISelBug.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/2012-08-07-CmpISelBug.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/X86/2012-08-07-CmpISelBug.ll (original)
>>> +++ llvm/trunk/test/CodeGen/X86/2012-08-07-CmpISelBug.ll Tue Jul 23
>>> 05:39:08 2019
>>> @@ -8,14 +8,13 @@
>>>  define void @foo(i8 %arg4, i32 %arg5, i32* %arg14) nounwind {
>>>  ; CHECK-LABEL: foo:
>>>  ; CHECK:       ## %bb.0: ## %bb
>>> +; CHECK-NEXT:    ## kill: def $edi killed $edi def $rdi
>>>  ; CHECK-NEXT:    andl $32, %edi
>>> -; CHECK-NEXT:    orl $1601159181, %edi ## imm = 0x5F6FC00D
>>> -; CHECK-NEXT:    andl %edi, %esi
>>> -; CHECK-NEXT:    xorb $-14, %dil
>>> -; CHECK-NEXT:    addb $82, %dil
>>> -; CHECK-NEXT:    shrl $5, %esi
>>> -; CHECK-NEXT:    movzbl %dil, %eax
>>> -; CHECK-NEXT:    testb %sil, %sil
>>> +; CHECK-NEXT:    leal 13(%rdi), %eax
>>> +; CHECK-NEXT:    xorb $-14, %al
>>> +; CHECK-NEXT:    addb $82, %al
>>> +; CHECK-NEXT:    movzbl %al, %eax
>>> +; CHECK-NEXT:    testl %esi, %edi
>>>  ; CHECK-NEXT:    movl $1, %ecx
>>>  ; CHECK-NEXT:    cmovnel %eax, %ecx
>>>  ; CHECK-NEXT:    xorb $81, %cl
>>>
>>> Modified: llvm/trunk/test/CodeGen/X86/vector-fshl-128.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/vector-fshl-128.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/X86/vector-fshl-128.ll (original)
>>> +++ llvm/trunk/test/CodeGen/X86/vector-fshl-128.ll Tue Jul 23 05:39:08
>>> 2019
>>> @@ -498,7 +498,6 @@ define <4 x i32> @var_funnnel_v4i32(<4 x
>>>  define <8 x i16> @var_funnnel_v8i16(<8 x i16> %x, <8 x i16> %y, <8 x
>>> i16> %amt) nounwind {
>>>  ; SSE2-LABEL: var_funnnel_v8i16:
>>>  ; SSE2:       # %bb.0:
>>> -; SSE2-NEXT:    pand {{.*}}(%rip), %xmm2
>>>  ; SSE2-NEXT:    movdqa {{.*#+}} xmm3 = [16,16,16,16,16,16,16,16]
>>>  ; SSE2-NEXT:    psubw %xmm2, %xmm3
>>>  ; SSE2-NEXT:    psllw $12, %xmm3
>>> @@ -531,6 +530,7 @@ define <8 x i16> @var_funnnel_v8i16(<8 x
>>>  ; SSE2-NEXT:    pandn %xmm1, %xmm4
>>>  ; SSE2-NEXT:    psrlw $1, %xmm1
>>>  ; SSE2-NEXT:    pand %xmm3, %xmm1
>>> +; SSE2-NEXT:    pand {{.*}}(%rip), %xmm2
>>>  ; SSE2-NEXT:    pxor %xmm3, %xmm3
>>>  ; SSE2-NEXT:    movdqa %xmm2, %xmm5
>>>  ; SSE2-NEXT:    punpckhwd {{.*#+}} xmm5 =
>>> xmm5[4],xmm3[4],xmm5[5],xmm3[5],xmm5[6],xmm3[6],xmm5[7],xmm3[7]
>>> @@ -768,7 +768,6 @@ define <8 x i16> @var_funnnel_v8i16(<8 x
>>>  ;
>>>  ; X32-SSE-LABEL: var_funnnel_v8i16:
>>>  ; X32-SSE:       # %bb.0:
>>> -; X32-SSE-NEXT:    pand {{\.LCPI.*}}, %xmm2
>>>  ; X32-SSE-NEXT:    movdqa {{.*#+}} xmm3 = [16,16,16,16,16,16,16,16]
>>>  ; X32-SSE-NEXT:    psubw %xmm2, %xmm3
>>>  ; X32-SSE-NEXT:    psllw $12, %xmm3
>>> @@ -801,6 +800,7 @@ define <8 x i16> @var_funnnel_v8i16(<8 x
>>>  ; X32-SSE-NEXT:    pandn %xmm1, %xmm4
>>>  ; X32-SSE-NEXT:    psrlw $1, %xmm1
>>>  ; X32-SSE-NEXT:    pand %xmm3, %xmm1
>>> +; X32-SSE-NEXT:    pand {{\.LCPI.*}}, %xmm2
>>>  ; X32-SSE-NEXT:    pxor %xmm3, %xmm3
>>>  ; X32-SSE-NEXT:    movdqa %xmm2, %xmm5
>>>  ; X32-SSE-NEXT:    punpckhwd {{.*#+}} xmm5 =
>>> xmm5[4],xmm3[4],xmm5[5],xmm3[5],xmm5[6],xmm3[6],xmm5[7],xmm3[7]
>>>
>>> Modified: llvm/trunk/test/CodeGen/X86/vector-reduce-mul-widen.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/vector-reduce-mul-widen.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/X86/vector-reduce-mul-widen.ll (original)
>>> +++ llvm/trunk/test/CodeGen/X86/vector-reduce-mul-widen.ll Tue Jul 23
>>> 05:39:08 2019
>>> @@ -1820,17 +1820,17 @@ define i8 @test_v16i8(<16 x i8> %a0) {
>>>  ; AVX2-LABEL: test_v16i8:
>>>  ; AVX2:       # %bb.0:
>>>  ; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
>>> -; AVX2-NEXT:    vpmovzxbw {{.*#+}} xmm0 =
>>> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
>>> -; AVX2-NEXT:    vpmovzxbw {{.*#+}} xmm1 =
>>> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
>>> -; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>> +; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm0 =
>>> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
>>> +; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
>>> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
>>> +; AVX2-NEXT:    vpmullw %ymm1, %ymm0, %ymm0
>>> +; AVX2-NEXT:    vpand {{.*}}(%rip), %ymm0, %ymm0
>>> +; AVX2-NEXT:    vpackuswb %xmm0, %xmm0, %xmm1
>>> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm1[1,1,2,3]
>>> +; AVX2-NEXT:    vpmovzxbw {{.*#+}} ymm1 =
>>> xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
>>> +; AVX2-NEXT:    vpmullw %ymm1, %ymm0, %ymm0
>>>  ; AVX2-NEXT:    vmovdqa {{.*#+}} xmm1 =
>>> [255,255,255,255,255,255,255,255]
>>> -; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpackuswb %xmm0, %xmm0, %xmm2
>>> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm2[1,1,2,3]
>>> -; AVX2-NEXT:    vpmovzxbw {{.*#+}} xmm2 =
>>> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
>>> -; AVX2-NEXT:    vpmullw %xmm2, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpackuswb %xmm0, %xmm0, %xmm2
>>> +; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm2
>>> +; AVX2-NEXT:    vpackuswb %xmm0, %xmm2, %xmm2
>>>  ; AVX2-NEXT:    vpsrld $16, %xmm2, %xmm2
>>>  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} xmm2 =
>>> xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
>>>  ; AVX2-NEXT:    vpmullw %xmm2, %xmm0, %xmm0
>>> @@ -1840,6 +1840,7 @@ define i8 @test_v16i8(<16 x i8> %a0) {
>>>  ; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>>  ; AVX2-NEXT:    vpextrb $0, %xmm0, %eax
>>>  ; AVX2-NEXT:    # kill: def $al killed $al killed $eax
>>> +; AVX2-NEXT:    vzeroupper
>>>  ; AVX2-NEXT:    retq
>>>  ;
>>>  ; AVX512BW-LABEL: test_v16i8:
>>>
>>> Modified: llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll
>>> URL:
>>> http://llvm.org/viewvc/llvm-project/llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll?rev=366799&r1=366798&r2=366799&view=diff
>>>
>>> ==============================================================================
>>> --- llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll (original)
>>> +++ llvm/trunk/test/CodeGen/X86/vector-reduce-mul.ll Tue Jul 23 05:39:08
>>> 2019
>>> @@ -1792,14 +1792,11 @@ define i8 @test_v16i8(<16 x i8> %a0) {
>>>  ; AVX2-NEXT:    vpunpckhbw {{.*#+}} xmm1 =
>>> xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
>>>  ; AVX2-NEXT:    vpmovzxbw {{.*#+}} xmm0 =
>>> xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
>>>  ; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vmovdqa {{.*#+}} xmm1 =
>>> [255,255,255,255,255,255,255,255]
>>> -; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm0[2,3,2,3]
>>> -; AVX2-NEXT:    vpmullw %xmm2, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpshufd {{.*#+}} xmm2 = xmm0[1,1,2,3]
>>> -; AVX2-NEXT:    vpmullw %xmm2, %xmm0, %xmm0
>>> -; AVX2-NEXT:    vpand %xmm1, %xmm0, %xmm0
>>> +; AVX2-NEXT:    vpand {{.*}}(%rip), %xmm0, %xmm0
>>> +; AVX2-NEXT:    vpshufd {{.*#+}} xmm1 = xmm0[2,3,2,3]
>>> +; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>> +; AVX2-NEXT:    vpshufb {{.*#+}} xmm1 =
>>> xmm0[4],zero,xmm0[6],zero,xmm0[4],zero,xmm0[6],zero,xmm0[8],zero,xmm0[10],zero,xmm0[12],zero,xmm0[14],zero
>>> +; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>>  ; AVX2-NEXT:    vpshuflw {{.*#+}} xmm1 = xmm0[1,1,2,3,4,5,6,7]
>>>  ; AVX2-NEXT:    vpmullw %xmm1, %xmm0, %xmm0
>>>  ; AVX2-NEXT:    vpextrb $0, %xmm0, %eax
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190729/38aab1d5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4849 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190729/38aab1d5/attachment-0001.bin>


More information about the llvm-commits mailing list