[llvm] ebf3b18 - [Scheduling] Implement a new way to cluster loads/stores

Fri Oct 23 15:38:37 PDT 2020

Hi Steven,

I was actually able to extract a small bitcode test case that reproduces with just “llc < slow3.bc”.

—Owen

> On Oct 23, 2020, at 4:14 AM, Qing Shan Zhang <qshanz at cn.ibm.com> wrote:
> 
> Hi, Owen,
> 
> Could you please open a bug in bugzilla and attach the preprocessed file and the command line options ? It is really hard for me to rebuild the whole project and guess your options ... Thank you.
> 
> Best regards
> 
> steven.zhang(张青山)
> XL/LLVM on Power Compiler Developer
> 
> IBM China Development Lab, Shanghai 
> Tel: (8621)609-28454 Mobile: +8615900986116
> E-mail: qshanz at cn.ibm.com <mailto:qshanz at cn.ibm.com>
> 
> -----Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>> wrote: -----
> To: Qing Shan Zhang <qshanz at cn.ibm.com <mailto:qshanz at cn.ibm.com>>
> From: Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>>
> Date: 10/23/2020 12:34PM
> Cc: QingShan Zhang <llvmlistbot at llvm.org <mailto:llvmlistbot at llvm.org>>, llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
> Subject: [EXTERNAL] Re: [llvm] ebf3b18 - [Scheduling] Implement a new way to cluster loads/stores
> 
> Hi Steven,
> 
> The test in question is this source file:  https://github.com/opencv/opencv/blob/master/modules/calib3d/src/dls.cpp <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_opencv_opencv_blob_master_modules_calib3d_src_dls.cpp&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4FedcNQfP4QcaBGfhp8oq2wPQHIv51RNpo7UEhopmZc&m=S_Cc4Lb1D2n-EdPe62FLL2-5I8cZfesL1NprviyMaR0&s=_HGDqYAMSrlO8kOOn_CPZF_l7DOdXGvqf9KMlSMw7Og&e=>
> 
> I haven’t been able to reduce it yet, but I would guess it’s coming from either dls::fill_coeff or dls::cayley_LS_M.
> 
> —Owen
> 
>> On Oct 22, 2020, at 7:27 PM, Qing Shan Zhang <qshanz at cn.ibm.com <mailto:qshanz at cn.ibm.com>> wrote:
>> 
>> Hi, Owen,
>> 
>> Do you have any test/IR for me so that I can measure the compiling time impact. For now, isReachable() is basing on the dynamic DAG which can be improved to static DAG with the loss of some corner cases. I will have some try to see if it fixes your problem.
>> 
>> Best regards
>> 
>> steven.zhang(张青山)
>> XL/LLVM on Power Compiler Developer
>> 
>> IBM China Development Lab, Shanghai 
>> Tel: (8621)609-28454 Mobile: +8615900986116
>> E-mail: qshanz at cn.ibm.com <mailto:qshanz at cn.ibm.com>
>> 
>> -----Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>> wrote: -----
>> To: QingShan Zhang <qshanz at cn.ibm.com <mailto:qshanz at cn.ibm.com>>, QingShan Zhang <llvmlistbot at llvm.org <mailto:llvmlistbot at llvm.org>>
>> From: Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>>
>> Date: 10/23/2020 02:28AM
>> Cc: llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
>> Subject: [EXTERNAL] Re: [llvm] ebf3b18 - [Scheduling] Implement a new way to cluster loads/stores
>> 
>> Hi QingShan,
>> 
>> I’m seeing significant compile time problems introduced by this change on AArch64, with a worst case example taking >10 minutes to compile.  The root cause seems to be the IsReachable() calls introduced here - they’re not cheap, and they’re being called a lot.  Is there some way we can back off or limit the number of queries here?
>> 
>> —Owen
>> 
>> > On Aug 26, 2020, at 5:34 AM, QingShan Zhang via llvm-commits <llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>> wrote:
>> > 
>> > 
>> > Author: QingShan Zhang
>> > Date: 2020-08-26T12:33:59Z
>> > New Revision: ebf3b188c6edcce7e90ddcacbe7c51c90d95b0ac
>> > 
>> > URL: https://github.com/llvm/llvm-project/commit/ebf3b188c6edcce7e90ddcacbe7c51c90d95b0ac <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_llvm_llvm-2Dproject_commit_ebf3b188c6edcce7e90ddcacbe7c51c90d95b0ac&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4FedcNQfP4QcaBGfhp8oq2wPQHIv51RNpo7UEhopmZc&m=S_Cc4Lb1D2n-EdPe62FLL2-5I8cZfesL1NprviyMaR0&s=RX2b7DcWTP_l-jL5dc4eGFIiwwp-IDVVLaKI0m2rm74&e=> 
>> > DIFF: https://github.com/llvm/llvm-project/commit/ebf3b188c6edcce7e90ddcacbe7c51c90d95b0ac.diff <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_llvm_llvm-2Dproject_commit_ebf3b188c6edcce7e90ddcacbe7c51c90d95b0ac.diff&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4FedcNQfP4QcaBGfhp8oq2wPQHIv51RNpo7UEhopmZc&m=S_Cc4Lb1D2n-EdPe62FLL2-5I8cZfesL1NprviyMaR0&s=qy8a-h8jwdnffoEGZ-u0PlfEi_mJCLOnA6O8xMdWIXs&e=> 
>> > 
>> > LOG: [Scheduling] Implement a new way to cluster loads/stores
>> > 
>> > Before calling target hook to determine if two loads/stores are clusterable,
>> > we put them into different groups to avoid fake cluster due to dependency.
>> > For now, we are putting the loads/stores into the same group if they have
>> > the same predecessor. We assume that, if two loads/stores have the same
>> > predecessor, it is likely that, they didn't have dependency for each other.
>> > 
>> > However, one SUnit might have several predecessors and for now, we just
>> > pick up the first predecessor that has non-data/non-artificial dependency,
>> > which is too arbitrary. And we are struggling to fix it.
>> > 
>> > So, I am proposing some better implementation.
>> > 1. Collect all the loads/stores that has memory info first to reduce the complexity.
>> > 2. Sort these loads/stores so that we can stop the seeking as early as possible.
>> > 3. For each load/store, seeking for the first non-dependency instruction with the
>> >  sorted order, and check if they can cluster or not.
>> > 
>> > Reviewed By: Jay Foad
>> > 
>> > Differential Revision: https://reviews.llvm.org/D85517 <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D85517&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4FedcNQfP4QcaBGfhp8oq2wPQHIv51RNpo7UEhopmZc&m=S_Cc4Lb1D2n-EdPe62FLL2-5I8cZfesL1NprviyMaR0&s=Gyj_3k8KK3RiZTcbSZ9klcv1-urxz2_MrxXCG1GPz0o&e=> 
>> > 
>> > Added: 
>> > 
>> > 
>> > Modified: 
>> >   llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
>> >   llvm/lib/CodeGen/MachineScheduler.cpp
>> >   llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
>> >   llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
>> >   llvm/test/CodeGen/AMDGPU/max.i16.ll
>> >   llvm/test/CodeGen/AMDGPU/stack-realign.ll
>> > 
>> > Removed: 
>> > 
>> > 
>> > 
>> > ################################################################################
>> > diff  --git a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
>> > index 1eb9b9f322ba..d2b95209d7b4 100644
>> > --- a/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
>> > +++ b/llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h
>> > @@ -268,6 +268,11 @@ namespace llvm {
>> >      return SU->SchedClass;
>> >    }
>> > 
>> > +    /// IsReachable - Checks if SU is reachable from TargetSU.
>> > +    bool IsReachable(SUnit *SU, SUnit *TargetSU) {
>> > +      return Topo.IsReachable(SU, TargetSU);
>> > +    }
>> > +
>> >    /// Returns an iterator to the top of the current scheduling region.
>> >    MachineBasicBlock::iterator begin() const { return RegionBegin; }
>> > 
>> > 
>> > diff  --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
>> > index a8ccf2643f20..b6d0d9a74ac1 100644
>> > --- a/llvm/lib/CodeGen/MachineScheduler.cpp
>> > +++ b/llvm/lib/CodeGen/MachineScheduler.cpp
>> > @@ -1530,7 +1530,10 @@ class BaseMemOpClusterMutation : public ScheduleDAGMutation {
>> >  void apply(ScheduleDAGInstrs *DAGInstrs) override;
>> > 
>> > protected:
>> > -  void clusterNeighboringMemOps(ArrayRef<SUnit *> MemOps, ScheduleDAGInstrs *DAG);
>> > +  void clusterNeighboringMemOps(ArrayRef<MemOpInfo> MemOps,
>> > +                                ScheduleDAGInstrs *DAG);
>> > +  void collectMemOpRecords(std::vector<SUnit> &SUnits,
>> > +                           SmallVectorImpl<MemOpInfo> &MemOpRecords);
>> > };
>> > 
>> > class StoreClusterMutation : public BaseMemOpClusterMutation {
>> > @@ -1566,63 +1569,53 @@ createStoreClusterDAGMutation(const TargetInstrInfo *TII,
>> > 
>> > } // end namespace llvm
>> > 
>> > +// Sorting all the loads/stores first, then for each load/store, checking the
>> > +// following load/store one by one, until reach the first non-dependent one and
>> > +// call target hook to see if they can cluster.
>> > void BaseMemOpClusterMutation::clusterNeighboringMemOps(
>> > -    ArrayRef<SUnit *> MemOps, ScheduleDAGInstrs *DAG) {
>> > -  SmallVector<MemOpInfo, 32> MemOpRecords;
>> > -  for (SUnit *SU : MemOps) {
>> > -    const MachineInstr &MI = *SU->getInstr();
>> > -    SmallVector<const MachineOperand *, 4> BaseOps;
>> > -    int64_t Offset;
>> > -    bool OffsetIsScalable;
>> > -    unsigned Width;
>> > -    if (TII->getMemOperandsWithOffsetWidth(MI, BaseOps, Offset,
>> > -                                           OffsetIsScalable, Width, TRI)) {
>> > -      MemOpRecords.push_back(MemOpInfo(SU, BaseOps, Offset, Width));
>> > -
>> > -      LLVM_DEBUG(dbgs() << "Num BaseOps: " << BaseOps.size() << ", Offset: "
>> > -                        << Offset << ", OffsetIsScalable: " << OffsetIsScalable
>> > -                        << ", Width: " << Width << "\n");
>> > -    }
>> > -#ifndef NDEBUG
>> > -    for (auto *Op : BaseOps)
>> > -      assert(Op);
>> > -#endif
>> > -  }
>> > -  if (MemOpRecords.size() < 2)
>> > -    return;
>> > -
>> > -  llvm::sort(MemOpRecords);
>> > +    ArrayRef<MemOpInfo> MemOpRecords, ScheduleDAGInstrs *DAG) {
>> > +  // Keep track of the current cluster length and bytes for each SUnit.
>> > +  DenseMap<unsigned, std::pair<unsigned, unsigned>> SUnit2ClusterInfo;
>> > 
>> >  // At this point, `MemOpRecords` array must hold atleast two mem ops. Try to
>> >  // cluster mem ops collected within `MemOpRecords` array.
>> > -  unsigned ClusterLength = 1;
>> > -  unsigned CurrentClusterBytes = MemOpRecords[0].Width;
>> >  for (unsigned Idx = 0, End = MemOpRecords.size(); Idx < (End - 1); ++Idx) {
>> >    // Decision to cluster mem ops is taken based on target dependent logic
>> >    auto MemOpa = MemOpRecords[Idx];
>> > -    auto MemOpb = MemOpRecords[Idx + 1];
>> > -    ++ClusterLength;
>> > -    CurrentClusterBytes += MemOpb.Width;
>> > -    if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps, ClusterLength,
>> > -                                  CurrentClusterBytes)) {
>> > -      // Current mem ops pair could not be clustered, reset cluster length, and
>> > -      // go to next pair
>> > -      ClusterLength = 1;
>> > -      CurrentClusterBytes = MemOpb.Width;
>> > +
>> > +    // Seek for the next load/store to do the cluster.
>> > +    unsigned NextIdx = Idx + 1;
>> > +    for (; NextIdx < End; ++NextIdx)
>> > +      // Skip if MemOpb has been clustered already or has dependency with
>> > +      // MemOpa.
>> > +      if (!SUnit2ClusterInfo.count(MemOpRecords[NextIdx].SU->NodeNum) &&
>> > +          !DAG->IsReachable(MemOpRecords[NextIdx].SU, MemOpa.SU) &&
>> > +          !DAG->IsReachable(MemOpa.SU, MemOpRecords[NextIdx].SU))
>> > +        break;
>> > +    if (NextIdx == End)
>> >      continue;
>> > +
>> > +    auto MemOpb = MemOpRecords[NextIdx];
>> > +    unsigned ClusterLength = 2;
>> > +    unsigned CurrentClusterBytes = MemOpa.Width + MemOpb.Width;
>> > +    if (SUnit2ClusterInfo.count(MemOpa.SU->NodeNum)) {
>> > +      ClusterLength = SUnit2ClusterInfo[MemOpa.SU->NodeNum].first + 1;
>> > +      CurrentClusterBytes =
>> > +          SUnit2ClusterInfo[MemOpa.SU->NodeNum].second + MemOpb.Width;
>> >    }
>> > 
>> > +    if (!TII->shouldClusterMemOps(MemOpa.BaseOps, MemOpb.BaseOps, ClusterLength,
>> > +                                  CurrentClusterBytes))
>> > +      continue;
>> > +
>> >    SUnit *SUa = MemOpa.SU;
>> >    SUnit *SUb = MemOpb.SU;
>> >    if (SUa->NodeNum > SUb->NodeNum)
>> >      std::swap(SUa, SUb);
>> > 
>> >    // FIXME: Is this check really required?
>> > -    if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster))) {
>> > -      ClusterLength = 1;
>> > -      CurrentClusterBytes = MemOpb.Width;
>> > +    if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
>> >      continue;
>> > -    }
>> > 
>> >    LLVM_DEBUG(dbgs() << "Cluster ld/st SU(" << SUa->NodeNum << ") - SU("
>> >                      << SUb->NodeNum << ")\n");
>> > @@ -1656,42 +1649,57 @@ void BaseMemOpClusterMutation::clusterNeighboringMemOps(
>> >      }
>> >    }
>> > 
>> > +    SUnit2ClusterInfo[MemOpb.SU->NodeNum] = {ClusterLength,
>> > +                                             CurrentClusterBytes};
>> > +
>> >    LLVM_DEBUG(dbgs() << "  Curr cluster length: " << ClusterLength
>> >                      << ", Curr cluster bytes: " << CurrentClusterBytes
>> >                      << "\n");
>> >  }
>> > }
>> > 
>> > -/// Callback from DAG postProcessing to create cluster edges for loads.
>> > -void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
>> > -  // Map DAG NodeNum to a set of dependent MemOps in store chain.
>> > -  DenseMap<unsigned, SmallVector<SUnit *, 4>> StoreChains;
>> > -  for (SUnit &SU : DAG->SUnits) {
>> > +void BaseMemOpClusterMutation::collectMemOpRecords(
>> > +    std::vector<SUnit> &SUnits, SmallVectorImpl<MemOpInfo> &MemOpRecords) {
>> > +  for (auto &SU : SUnits) {
>> >    if ((IsLoad && !SU.getInstr()->mayLoad()) ||
>> >        (!IsLoad && !SU.getInstr()->mayStore()))
>> >      continue;
>> > 
>> > -    unsigned ChainPredID = DAG->SUnits.size();
>> > -    for (const SDep &Pred : SU.Preds) {
>> > -      // We only want to cluster the mem ops that have the same ctrl(non-data)
>> > -      // pred so that they didn't have ctrl dependency for each other. But for
>> > -      // store instrs, we can still cluster them if the pred is load instr.
>> > -      if ((Pred.isCtrl() &&
>> > -           (IsLoad ||
>> > -            (Pred.getSUnit() && Pred.getSUnit()->getInstr()->mayStore()))) &&
>> > -          !Pred.isArtificial()) {
>> > -        ChainPredID = Pred.getSUnit()->NodeNum;
>> > -        break;
>> > -      }
>> > +    const MachineInstr &MI = *SU.getInstr();
>> > +    SmallVector<const MachineOperand *, 4> BaseOps;
>> > +    int64_t Offset;
>> > +    bool OffsetIsScalable;
>> > +    unsigned Width;
>> > +    if (TII->getMemOperandsWithOffsetWidth(MI, BaseOps, Offset,
>> > +                                           OffsetIsScalable, Width, TRI)) {
>> > +      MemOpRecords.push_back(MemOpInfo(&SU, BaseOps, Offset, Width));
>> > +
>> > +      LLVM_DEBUG(dbgs() << "Num BaseOps: " << BaseOps.size() << ", Offset: "
>> > +                        << Offset << ", OffsetIsScalable: " << OffsetIsScalable
>> > +                        << ", Width: " << Width << "\n");
>> >    }
>> > -    // Insert the SU to corresponding store chain.
>> > -    auto &Chain = StoreChains.FindAndConstruct(ChainPredID).second;
>> > -    Chain.push_back(&SU);
>> > +#ifndef NDEBUG
>> > +    for (auto *Op : BaseOps)
>> > +      assert(Op);
>> > +#endif
>> >  }
>> > +}
>> > +
>> > +/// Callback from DAG postProcessing to create cluster edges for loads/stores.
>> > +void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
>> > +  // Collect all the clusterable loads/stores
>> > +  SmallVector<MemOpInfo, 32> MemOpRecords;
>> > +  collectMemOpRecords(DAG->SUnits, MemOpRecords);
>> > +
>> > +  if (MemOpRecords.size() < 2)
>> > +    return;
>> > +
>> > +  // Sorting the loads/stores, so that, we can stop the cluster as early as
>> > +  // possible.
>> > +  llvm::sort(MemOpRecords);
>> > 
>> > -  // Iterate over the store chains.
>> > -  for (auto &SCD : StoreChains)
>> > -    clusterNeighboringMemOps(SCD.second, DAG);
>> > +  // Trying to cluster all the neighboring loads/stores.
>> > +  clusterNeighboringMemOps(MemOpRecords, DAG);
>> > }
>> > 
>> > //===----------------------------------------------------------------------===//
>> > 
>> > diff  --git a/llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll b/llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
>> > index b0ed3d0490cc..e95321582def 100644
>> > --- a/llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
>> > +++ b/llvm/test/CodeGen/AArch64/aarch64-stp-cluster.ll
>> > @@ -214,11 +214,11 @@ entry:
>> >  ret void
>> > }
>> > 
>> > -; FIXME - The SU(4) and SU(7) can be clustered even with
>> > +; Verify that the SU(4) and SU(7) can be clustered even with
>> > ; 
>> > diff erent preds
>> > ; CHECK: ********** MI Scheduling **********
>> > ; CHECK-LABEL: cluster_with_
>> > diff erent_preds:%bb.0
>> > -; CHECK-NOT:Cluster ld/st SU(4) - SU(7)
>> > +; CHECK:Cluster ld/st SU(4) - SU(7)
>> > ; CHECK:SU(3):   STRWui %2:gpr32, %0:gpr64common, 0 ::
>> > ; CHECK:SU(4):   %3:gpr32 = LDRWui %1:gpr64common, 0 ::
>> > ; CHECK:Predecessors:
>> > 
>> > diff  --git a/llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll b/llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
>> > index 2dc47ca94aa9..d23538cadcbd 100644
>> > --- a/llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
>> > +++ b/llvm/test/CodeGen/AMDGPU/callee-special-input-vgprs.ll
>> > @@ -624,9 +624,9 @@ define void @too_many_args_use_workitem_id_x_byval(
>> > 
>> > 
>> > ; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7
>> > +; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
>> > ; FIXEDABI: s_movk_i32 s32, 0x400{{$}}
>> > ; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140
>> > -; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], 0 offset:4{{$}}
>> > 
>> > ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
>> > 
>> > @@ -669,8 +669,8 @@ define amdgpu_kernel void @kern_call_too_many_args_use_workitem_id_x_byval() #1
>> > 
>> > ; FIXED-ABI-NOT: v31
>> > ; FIXEDABI: v_mov_b32_e32 [[K0:v[0-9]+]], 0x3e7{{$}}
>> > -; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
>> > ; FIXEDABI: buffer_store_dword [[K0]], off, s[0:3], s33{{$}}
>> > +; FIXEDABI: v_mov_b32_e32 [[K1:v[0-9]+]], 0x140{{$}}
>> > ; FIXEDABI: buffer_store_dword [[K1]], off, s[0:3], s32{{$}}
>> > ; FIXEDABI: buffer_load_dword [[RELOAD_BYVAL:v[0-9]+]], off, s[0:3], s33{{$}}
>> > 
>> > 
>> > diff  --git a/llvm/test/CodeGen/AMDGPU/max.i16.ll b/llvm/test/CodeGen/AMDGPU/max.i16.ll
>> > index c64e400e628f..7c4ce5d6c1ff 100644
>> > --- a/llvm/test/CodeGen/AMDGPU/max.i16.ll
>> > +++ b/llvm/test/CodeGen/AMDGPU/max.i16.ll
>> > @@ -145,14 +145,14 @@ define amdgpu_kernel void @v_test_imax_sge_v3i16(<3 x i16> addrspace(1)* %out, <
>> > ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
>> > ; GFX9-NEXT:    v_mov_b32_e32 v2, 0
>> > ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
>> > -; GFX9-NEXT:    global_load_short_d16 v2, v0, s[6:7] offset:4
>> > ; GFX9-NEXT:    global_load_short_d16 v1, v0, s[0:1] offset:4
>> > -; GFX9-NEXT:    global_load_dword v3, v0, s[6:7]
>> > -; GFX9-NEXT:    global_load_dword v4, v0, s[0:1]
>> > -; GFX9-NEXT:    s_waitcnt vmcnt(2)
>> > +; GFX9-NEXT:    global_load_dword v3, v0, s[0:1]
>> > +; GFX9-NEXT:    global_load_short_d16 v2, v0, s[6:7] offset:4
>> > +; GFX9-NEXT:    global_load_dword v4, v0, s[6:7]
>> > +; GFX9-NEXT:    s_waitcnt vmcnt(1)
>> > ; GFX9-NEXT:    v_pk_max_i16 v1, v2, v1
>> > ; GFX9-NEXT:    s_waitcnt vmcnt(0)
>> > -; GFX9-NEXT:    v_pk_max_i16 v3, v3, v4
>> > +; GFX9-NEXT:    v_pk_max_i16 v3, v4, v3
>> > ; GFX9-NEXT:    global_store_short v0, v1, s[4:5] offset:4
>> > ; GFX9-NEXT:    global_store_dword v0, v3, s[4:5]
>> > ; GFX9-NEXT:    s_endpgm
>> > 
>> > diff  --git a/llvm/test/CodeGen/AMDGPU/stack-realign.ll b/llvm/test/CodeGen/AMDGPU/stack-realign.ll
>> > index 74b53802ef5b..e8e3518aed1c 100644
>> > --- a/llvm/test/CodeGen/AMDGPU/stack-realign.ll
>> > +++ b/llvm/test/CodeGen/AMDGPU/stack-realign.ll
>> > @@ -160,16 +160,14 @@ define void @func_call_align1024_bp_gets_vgpr_spill(<32 x i32> %a, i32 %b) #0 {
>> > ; GCN-NEXT: s_mov_b64 exec, s[4:5]
>> > ; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s33, 2
>> > ; GCN-NEXT: v_writelane_b32 [[VGPR_REG]], s34, 3
>> > -; GCN: s_mov_b32 s34, s32
>> > ; GCN: s_add_u32 [[SCRATCH_REG:s[0-9]+]], s32, 0xffc0
>> > ; GCN: s_and_b32 s33, [[SCRATCH_REG]], 0xffff0000
>> > +; GCN: s_mov_b32 s34, s32
>> > +; GCN: v_mov_b32_e32 v32, 0
>> > +; GCN: buffer_store_dword v32, off, s[0:3], s33 offset:1024
>> > ; GCN-NEXT: buffer_load_dword v{{[0-9]+}}, off, s[0:3], s34
>> > ; GCN-NEXT: s_add_u32 s32, s32, 0x30000
>> > 
>> > -; GCN: v_mov_b32_e32 v33, 0
>> > -
>> > -; GCN: buffer_store_dword v33, off, s[0:3], s33 offset:1024
>> > -
>> > ; GCN: buffer_store_dword v{{[0-9]+}}, off, s[0:3], s32
>> > ; GCN-NEXT: s_swappc_b64 s[30:31], s[4:5]
>> > 
>> > 
>> > 
>> > 
>> > _______________________________________________
>> > llvm-commits mailing list
>> > llvm-commits at lists.llvm.org <mailto:llvm-commits at lists.llvm.org>
>> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Dcommits&d=DwMFaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=4FedcNQfP4QcaBGfhp8oq2wPQHIv51RNpo7UEhopmZc&m=S_Cc4Lb1D2n-EdPe62FLL2-5I8cZfesL1NprviyMaR0&s=X_pWCy6LUnCpN-gcnihGybJiQvSN7R2SA-yxNNJM1w4&e=> 
>> 
>> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201023/59ea1e4a/attachment-0002.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slow3.bc.xz
Type: application/octet-stream
Size: 205464 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201023/59ea1e4a/attachment-0001.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201023/59ea1e4a/attachment-0003.html>