[PATCH] R600: Make sure OQAP defs and uses happen in the same clause

Tue Nov 12 15:01:42 PST 2013

Hi Vincent,

Here is an updated patch where I added a call to
SubstituteKCacheBank() in canClauseLocalKillFitInClause()  This should
prevent OQAP uses and defs from being split because of constant bank
limitations.

Maybe we can leave the ScheduleDAGMutation optimization as a future
TODO.

-Tom

On Sun, Nov 03, 2013 at 10:19:16AM -0800, Vincent Lejeune wrote:
> I have put some comments below but otherwise the patch is
> reviewed-by: Vincent Lejeune <vljn at ovi.com>
> 
> 
> >-------------- next part --------------
> >>From 2eb4673e3184af0e077cbe30a594602441e8d98e Mon Sep 17 00:00:00 2001 >From: Tom Stellard <thomas.stellard at amd.com>
> >Date: Thu, 5 Sep 2013 08:59:32 -0700
> >Subject: [PATCH] R600: Fix scheduling of instructions that use the LDS output
> > queue
> >
> >The LDS output queue is accessed via the OQAP register.  The OQAP
> >register cannot be live across clauses, so if value is written to the
> >output queue, it must be retrieved before the end of the clause.
> >With the machine scheduler, we cannot statisfy this constraint, because
> >it lacks proper alias analysis and it will mark some LDS accesses as
> >having a chain dependency on vertex fetches.  Since vertex fetches
> 
> We can customize the dependency graph before machine scheduling takes place,
> using ScheduleDAGMutation.
> I already wrote some code to break artificial dependencies between vector
> subregister read/write here :
> http://cgit.freedesktop.org/~vlj/llvm/commit/?h=vliw5&id=e91b16a22845d0a80ed348f158ae7ab293e003a8
> While I'm expecting from Matthias Braun's Subregister patches to be upstreamed
> to obsolete most of this patch except tests, it can be reworked so that
> it'll parse all MEM dependency, and remove the ones between instructions
> touching different memory pool (like VTX_FETCH and LDS_READ).
> 
> >require a new clauses, the dependency may end up spiltting OQAP uses and
> >defs so the end up in different clauses.  See the lds-output-queue.ll
> >test for a more detailed explanation.
> >
> >To work around this issue, we now combine the LDS read and the OQAP
> >copy into one instruction and expand it after register allocation.
> >
> >This patch also adds some checks to the EmitClauseMarker pass, so that
> >it doesn't end a clause with a value still in the output queue and
> >removes AR.X and OQAP handling from the scheduler (AR.X uses and defs
> >were already being expanded post-RA, so the scheduler will never see
> >them).
> >---
> > lib/Target/R600/R600EmitClauseMarkers.cpp     | 52 ++++++++++++++
> > lib/Target/R600/R600ExpandSpecialInstrs.cpp   | 17 +++++
> > lib/Target/R600/R600ISelLowering.cpp          | 20 +++---
> > lib/Target/R600/R600InstrInfo.cpp             |  8 +++
> > lib/Target/R600/R600InstrInfo.h               |  2 +
> > lib/Target/R600/R600MachineScheduler.cpp      | 32 ---------
> > lib/Target/R600/R600MachineScheduler.h        |  2 -
> > lib/Target/R600/R600RegisterInfo.cpp          | 13 ++++
> > lib/Target/R600/R600RegisterInfo.h            |  2 +
> > test/CodeGen/R600/lds-output-queue.ll         | 99 +++++++++++++++++++++++++++
> > test/CodeGen/R600/local-memory-two-objects.ll |  8 ++-
> > 11 files changed, 206 insertions(+), 49 deletions(-)
> > create mode 100644 test/CodeGen/R600/lds-output-queue.ll
> >
> >diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp b/lib/Target/R600/R600EmitClauseMarkers.cpp
> 
> 
> >+  bool canClauseLocalKillFitInClause(
> >+                             unsigned AluInstCount,
> >+                             MachineBasicBlock::iterator Def,
> >+                             MachineBasicBlock::iterator BBEnd) {
> >+    const R600RegisterInfo &TRI = TII->getRegisterInfo();
> >+    for (MachineInstr::const_mop_iterator
> >+           MOI = Def->operands_begin(),
> >+           MOE = Def->operands_end(); MOI != MOE; ++MOI) {
> >+      if (!MOI->isReg() || !MOI->isDef() ||
> >+          TRI.isPhysRegLiveAcrossClauses(MOI->getReg()))
> >+        continue;
> >+
> >+      // Def defines a clause local register, so check that its use will fit
> >+      // in the clause.
> >+      unsigned LastUseCount = 0;
> >+      for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; ++UseI) {
> >+        AluInstCount += OccupiedDwords(UseI);
> >+        // We have reached the maximum instruction limit before finding the
> >+        // use that kills this register, so we cannot use this def in the
> >+        // current clause.
> >+        if (AluInstCount >= TII->getMaxAlusPerClause())
> >+          return false;
> >+
> >+        // Register kill flags have been cleared by the time we get to this
> >+        // pass, but it is safe to assume that all uses of this register
> >+        // occur in the same basic block as its definition, because
> >+        // it is illegal for the scheduler to schedule them in
> >+        // different blocks.
> >+        if (UseI->findRegisterUseOperandIdx(MOI->getReg()))
> >+          LastUseCount = AluInstCount;
> >+
> >+        if (UseI != Def && UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1)
> >+          break;
> >+      }
> >+      if (LastUseCount)
> >+        return LastUseCount <= TII->getMaxAlusPerClause();
> >+      llvm_unreachable("Clause local register live at end of clause.");
> >+    }
> >+    return true;
> >+  }
> 
> This function does not check if current clause can hold all constant bank.
> I think it's likely to be rare for a clause to be split because of constant bank limitations,
> but it would be better to have an assertion failure in such case to make debugging easier.
> For instance if the SubstituteKCacheBank return false, you can check that there is no lds
> use still pending.
> 
> 
> ----- Mail original -----
> > De : Tom Stellard <tom at stellard.net>
> > À : Vincent Lejeune <vljn at ovi.com>
> > Cc : "llvm-commits at cs.uiuc.edu" <llvm-commits at cs.uiuc.edu>; "mesa-dev at lists.freedesktop.org" <mesa-dev at lists.freedesktop.org>; Tom Stellard <thomas.stellard at amd.com>
> > Envoyé le : Mercredi 30 octobre 2013 17h46
> > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in the same clause
> > 
> > Hi Vincent,
> > 
> > It turns out that it's not possible to correctly schedule uses and defs
> > of the OQAP register without proper alias analysis in the MachineScheduler.  See
> > the explanation in the lds-output-queue.ll test case.
> > 
> > Here is an updated patch that fixes all the outstanding LDS scheduling
> > bugs that I am aware of.
> > 
> > -Tom
> > 
> > On Fri, Oct 25, 2013 at 05:42:13AM -0700, Vincent Lejeune wrote:
> >>  This patch should work when checking than no OQAP is used before beeing 
> > queued, assuming that a value in OQAP is consumed
> >>  and cannot be read twice. However I'm not sure I cover all LDS 
> > instructions that queues a value, I only use LDS_RET_READ in switch case.
> >> 
> >>  Vincent
> >> 
> >> 
> >> 
> >>  ----- Mail original -----
> >>  > De : Tom Stellard <tom at stellard.net>
> >>  > À : Vincent Lejeune <vljn at ovi.com>
> >>  > Cc : "llvm-commits at cs.uiuc.edu" 
> > <llvm-commits at cs.uiuc.edu>; "mesa-dev at lists.freedesktop.org" 
> > <mesa-dev at lists.freedesktop.org>; Tom Stellard 
> > <thomas.stellard at amd.com>
> >>  > Envoyé le : Mardi 22 octobre 2013 23h20
> >>  > Objet : Re: [PATCH] R600: Make sure OQAP defs and uses happen in the 
> > same clause
> >>  > 
> >>  > Hi Vincent,
> >>  > 
> >>  > Here is an updated patch.  I wasn't sure where to put the 
> > assertion to
> >>  > check that UnscheduledNoLiveOut{Defs,Uses} is empty when 
> > switching to a
> >>  > new clause.  I tried adding it to R600SchedStartegy::schedNode() 
> > behind
> >>  > the if (NextInstKind != CurInstKind) condition, but it always failed.
> >>  > Any suggestions on where I should but it?
> >>  > 
> >>  > -Tom
> >>  > 
> >>  > 
> >>  > On Mon, Oct 21, 2013 at 12:40:28PM -0700, Vincent Lejeune wrote:
> >>  >> 
> >>  >> 
> >>  >> 
> >>  >> 
> >>  >>  ----- Mail original -----
> >>  >>  > De : Tom Stellard <tom at stellard.net>
> >>  >>  > À : llvm-commits at cs.uiuc.edu
> >>  >>  > Cc : mesa-dev at lists.freedesktop.org; Tom Stellard 
> >>  > <thomas.stellard at amd.com>
> >>  >>  > Envoyé le : Vendredi 11 octobre 2013 20h10
> >>  >>  > Objet : [PATCH] R600: Make sure OQAP defs and uses happen in 
> > the same 
> >>  > clause
> >>  >>  > 
> >>  >>  > From: Tom Stellard <thomas.stellard at amd.com>
> >>  >>  > 
> >>  >>  > Reading the special OQAP register pops the top value off the 
> > LDS
> >>  >>  > input queue and returns it to the instruction.  This queue 
> > is
> >>  >>  > invalidated at the end of an ALU clause and leaving values 
> > in the 
> >>  > queue
> >>  >>  > can lead to GPU hangs.  This means that if we load a value 
> > into the 
> >>  > queue,
> >>  >>  > we must use it before the end of the clause.
> >>  >>  > 
> >>  >>  > This fixes some hangs in the OpenCV test suite.
> >>  >>  > ---
> >>  >>  > lib/Target/R600/R600MachineScheduler.cpp | 25 
> >>  > +++++++++++++------------
> >>  >>  > lib/Target/R600/R600MachineScheduler.h   |  4 ++--
> >>  >>  > test/CodeGen/R600/lds-input-queue.ll     | 26 
> >>  > ++++++++++++++++++++++++++
> >>  >>  > 3 files changed, 41 insertions(+), 14 deletions(-)
> >>  >>  > create mode 100644 test/CodeGen/R600/lds-input-queue.ll
> >>  >>  > 
> >>  >>  > diff --git a/lib/Target/R600/R600MachineScheduler.cpp 
> >>  >>  > b/lib/Target/R600/R600MachineScheduler.cpp
> >>  >>  > index 6c26d9e..611b7f4 100644
> >>  >>  > --- a/lib/Target/R600/R600MachineScheduler.cpp
> >>  >>  > +++ b/lib/Target/R600/R600MachineScheduler.cpp
> >>  >>  > @@ -93,11 +93,12 @@ SUnit* R600SchedStrategy::pickNode(bool 
> >>  > &IsTopNode) 
> >>  >>  > {
> >>  >>  >    }
> >>  >>  > 
> >>  >>  > 
> >>  >>  > -  // We want to scheduled AR defs as soon as possible to 
> > make sure 
> >>  > they 
> >>  >>  > aren't
> >>  >>  > -  // put in a different ALU clause from their uses.
> >>  >>  > -  if (!SU && !UnscheduledARDefs.empty()) {
> >>  >>  > -      SU = UnscheduledARDefs[0];
> >>  >>  > -      UnscheduledARDefs.erase(UnscheduledARDefs.begin());
> >>  >>  > +  // We want to scheduled defs that cannot be live outside 
> > of this 
> >>  > clause 
> >>  >>  > +  // as soon as possible to make sure they aren't put 
> > in a 
> >>  > different
> >>  >>  > +  // ALU clause from their uses.
> >>  >>  > +  if (!SU && !UnscheduledNoLiveOutDefs.empty()) 
> > {
> >>  >>  > +      SU = UnscheduledNoLiveOutDefs[0];
> >>  >>  > +      
> >>  > UnscheduledNoLiveOutDefs.erase(UnscheduledNoLiveOutDefs.begin());
> >>  >>  >        NextInstKind = IDAlu;
> >>  >>  >    }
> >>  >>  > 
> >>  >>  > @@ -132,9 +133,9 @@ SUnit* R600SchedStrategy::pickNode(bool 
> >>  > &IsTopNode) 
> >>  >>  > {
> >>  >>  > 
> >>  >>  >    // We want to schedule the AR uses as late as possible to 
> > make sure 
> >>  > that
> >>  >>  >    // the AR defs have been released.
> >>  >>  > -  if (!SU && !UnscheduledARUses.empty()) {
> >>  >>  > -      SU = UnscheduledARUses[0];
> >>  >>  > -      UnscheduledARUses.erase(UnscheduledARUses.begin());
> >>  >>  > +  if (!SU && !UnscheduledNoLiveOutUses.empty()) 
> > {
> >>  >>  > +      SU = UnscheduledNoLiveOutUses[0];
> >>  >>  > +      
> >>  > UnscheduledNoLiveOutUses.erase(UnscheduledNoLiveOutUses.begin());
> >>  >> 
> >>  >>  Can we use std::queue<SUnit*> instead of a std::vector for 
> >>  > UnscheduledNoLiveOutUses ?
> >>  >>  I had to use a vector because I needed to be able to pop non 
> > topmost SUnit 
> >>  > in some case
> >>  >>  (to fit Instruction Group const read limitation) but I would 
> > rather avoid 
> >>  > erase(iterator) call
> >>  >>  when possible.
> >>  >> 
> >>  >> 
> >>  >>  >        NextInstKind = IDAlu;
> >>  >>  >    }
> >>  >>  > 
> >>  >>  > @@ -217,15 +218,15 @@ void 
> > R600SchedStrategy::releaseBottomNode(SUnit 
> >>  > *SU) 
> >>  >>  > {
> >>  >>  > 
> >>  >>  >    int IK = getInstKind(SU);
> >>  >>  > 
> >>  >>  > -  // Check for AR register defines
> >>  >>  > +  // Check for registers that do not live across ALU 
> > clauses.
> >>  >>  >    for (MachineInstr::const_mop_iterator I = 
> >>  >>  > SU->getInstr()->operands_begin(),
> >>  >>  >                                          E = 
> >>  >>  > SU->getInstr()->operands_end();
> >>  >>  >                                          I != E; ++I) {
> >>  >>  > -    if (I->isReg() && I->getReg() == 
> > AMDGPU::AR_X) 
> >>  > {
> >>  >>  > +    if (I->isReg() && (I->getReg() == 
> > AMDGPU::AR_X || 
> >>  >>  > I->getReg() == AMDGPU::OQAP)) {
> >>  >>  >        if (I->isDef()) {
> >>  >>  > -        UnscheduledARDefs.push_back(SU);
> >>  >>  > +        UnscheduledNoLiveOutDefs.push_back(SU);
> >>  >>  >        } else {
> >>  >>  > -        UnscheduledARUses.push_back(SU);
> >>  >>  > +        UnscheduledNoLiveOutUses.push_back(SU);
> >>  >>  >        }
> >>  >>  >        return;
> >>  >>  >      }
> >>  >>  > diff --git a/lib/Target/R600/R600MachineScheduler.h 
> >>  >>  > b/lib/Target/R600/R600MachineScheduler.h
> >>  >>  > index 0a6f120..db2e188 100644
> >>  >>  > --- a/lib/Target/R600/R600MachineScheduler.h
> >>  >>  > +++ b/lib/Target/R600/R600MachineScheduler.h
> >>  >>  > @@ -53,8 +53,8 @@ class R600SchedStrategy : public 
> >>  > MachineSchedStrategy {
> >>  >>  > 
> >>  >>  >    std::vector<SUnit *> Available[IDLast], 
> > Pending[IDLast];
> >>  >>  >    std::vector<SUnit *> AvailableAlus[AluLast];
> >>  >>  > -  std::vector<SUnit *> UnscheduledARDefs;
> >>  >>  > -  std::vector<SUnit *> UnscheduledARUses;
> >>  >>  > +  std::vector<SUnit *> UnscheduledNoLiveOutDefs;
> >>  >>  > +  std::vector<SUnit *> UnscheduledNoLiveOutUses;
> >>  >>  >    std::vector<SUnit *> PhysicalRegCopy;
> >>  >>  > 
> >>  >>  >    InstKind CurInstKind;
> >>  >>  > diff --git a/test/CodeGen/R600/lds-input-queue.ll 
> >>  >>  > b/test/CodeGen/R600/lds-input-queue.ll
> >>  >>  > new file mode 100644
> >>  >>  > index 0000000..548b41c
> >>  >>  > --- /dev/null
> >>  >>  > +++ b/test/CodeGen/R600/lds-input-queue.ll
> >>  >>  > @@ -0,0 +1,26 @@
> >>  >>  > +; RUN: llc < %s -march=r600 -mcpu=redwood | FileCheck %s
> >>  >> 
> >>  >>  Does the test work with -verify-machineinstrs flag set ?
> >>  >> 
> >>  >>  > +;
> >>  >>  > +; This test checks that the lds input queue will is empty 
> > at the end 
> >>  > of
> >>  >>  > +; the ALU clause.
> >>  >>  > +
> >>  >>  > +; CHECK-LABEL: @lds_input_queue
> >>  >>  > +; CHECK: LDS_READ_RET * OQAP
> >>  >>  > +; CHECK-NOT: ALU clause
> >>  >>  > +; CHECK: MOV T{{[0-9]\.[XYZW]}}, OQAP
> >>  >>  > +
> >>  >>  > + at local_mem = internal addrspace(3) unnamed_addr global [2 x 
> > i32] [i32 
> >>  > 1, i32 
> >>  >>  > 2], align 4
> >>  >>  > +
> >>  >>  > +define void @lds_input_queue(i32 addrspace(1)* %out, i32 
> >>  > addrspace(1)* %in, i32 
> >>  >>  > %index) {
> >>  >>  > +entry:
> >>  >>  > +  %0 = getelementptr inbounds [2 x i32] addrspace(3)* 
> > @local_mem, i32 
> >>  > 0, i32 
> >>  >>  > %index
> >>  >>  > +  %1 = load i32 addrspace(3)* %0
> >>  >>  > +  call void @llvm.AMDGPU.barrier.local()
> >>  >>  > +
> >>  >>  > +  ; This will start a new clause for the vertex fetch
> >>  >>  > +  %2 = load i32 addrspace(1)* %in
> >>  >>  > +  %3 = add i32 %1, %2
> >>  >>  > +  store i32 %3, i32 addrspace(1)* %out
> >>  >>  > +  ret void
> >>  >>  > +}
> >>  >>  > +
> >>  >>  > +declare void @llvm.AMDGPU.barrier.local()
> >>  >>  > -- 
> >>  >>  > 1.7.11.4
> >>  >>  > 
> >>  >>  > _______________________________________________
> >>  >>  > llvm-commits mailing list
> >>  >>  > llvm-commits at cs.uiuc.edu
> >>  >>  > http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >>  >>  >
> >>  > 
> > 
> >>  From 956c9986fcda84c4a4b0a555432319bbdfd98cfc Mon Sep 17 00:00:00 2001
> >>  From: Vincent Lejeune <vljn at ovi.com>
> >>  Date: Fri, 25 Oct 2013 14:32:35 +0200
> >>  Subject: [PATCH] R600: Uses Assert for OQAP
> >> 
> >>  ---
> >>   lib/Target/R600/R600EmitClauseMarkers.cpp | 31 
> > +++++++++++++++++++++++++++++++
> >>   lib/Target/R600/R600ISelLowering.cpp      |  2 ++
> >>   2 files changed, 33 insertions(+)
> >> 
> >>  diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp 
> > b/lib/Target/R600/R600EmitClauseMarkers.cpp
> >>  index 928c0e3..9951b15 100644
> >>  --- a/lib/Target/R600/R600EmitClauseMarkers.cpp
> >>  +++ b/lib/Target/R600/R600EmitClauseMarkers.cpp
> >>  @@ -33,6 +33,8 @@ private:
> >>     static char ID;
> >>     const R600InstrInfo *TII;
> >>     int Address;
> >>  +  int OQAPLevel;
> >>  +  bool ArDefined;
> >>   
> >>     unsigned OccupiedDwords(MachineInstr *MI) const {
> >>       switch (MI->getOpcode()) {
> >>  @@ -160,8 +162,36 @@ private:
> >>       return true;
> >>     }
> >>   
> >>  +  void AssertARAndOQAPCorrectness(MachineInstr *MI) {
> >>  +    for (MachineInstr::mop_iterator MOp = MI->operands_begin(),
> >>  +        E = MI->operands_end(); MOp != E; MOp++) {
> >>  +      MachineOperand &MO = *MOp;
> >>  +      if (!MO.isReg() || MO.isDef())
> >>  +        continue;
> >>  +      switch (MO.getReg()) {
> >>  +      default: break;
> >>  +      case AMDGPU::OQAP:
> >>  +        OQAPLevel--;
> >>  +        assert (OQAPLevel >= 0 && "OQAP not defined in 
> > this clause !");
> >>  +        break;
> >>  +      }
> >>  +    }
> >>  +    switch (MI->getOpcode()) {
> >>  +    default: break;
> >>  +    case AMDGPU::MOVA_INT_eg:
> >>  +      ArDefined = true;
> >>  +      break;
> >>  +    case AMDGPU::LDS_READ_RET:
> >>  +      OQAPLevel++;
> >>  +      break;
> >>  +    }
> >>  +  }
> >>  +
> >>     MachineBasicBlock::iterator
> >>     MakeALUClause(MachineBasicBlock &MBB, MachineBasicBlock::iterator I) 
> > {
> >>  +    // LDS queue/ AR registers get resetted in new clause.
> >>  +    OQAPLevel = 0;
> >>  +    ArDefined = false;
> >>       MachineBasicBlock::iterator ClauseHead = I;
> >>       std::vector<std::pair<unsigned, unsigned> > KCacheBanks;
> >>       bool PushBeforeModifier = false;
> >>  @@ -205,6 +235,7 @@ private:
> >>             !SubstituteKCacheBank(I, KCacheBanks))
> >>           break;
> >>         AluInstCount += OccupiedDwords(I);
> >>  +      AssertARAndOQAPCorrectness(I);
> >>       }
> >>       unsigned Opcode = PushBeforeModifier ?
> >>           AMDGPU::CF_ALU_PUSH_BEFORE : AMDGPU::CF_ALU;
> >>  diff --git a/lib/Target/R600/R600ISelLowering.cpp 
> > b/lib/Target/R600/R600ISelLowering.cpp
> >>  index 3c2e388..d8cbcae 100644
> >>  --- a/lib/Target/R600/R600ISelLowering.cpp
> >>  +++ b/lib/Target/R600/R600ISelLowering.cpp
> >>  @@ -215,6 +215,7 @@ MachineBasicBlock * 
> > R600TargetLowering::EmitInstrWithCustomInserter(
> >>     }
> >>   
> >>     case AMDGPU::TXD: {
> >>  +    abort();
> >>       unsigned T0 = 
> > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
> >>       unsigned T1 = 
> > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
> >>       MachineOperand &RID = MI->getOperand(4);
> >>  @@ -316,6 +317,7 @@ MachineBasicBlock * 
> > R600TargetLowering::EmitInstrWithCustomInserter(
> >>     }
> >>   
> >>     case AMDGPU::TXD_SHADOW: {
> >>  +    abort();
> >>       unsigned T0 = 
> > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
> >>       unsigned T1 = 
> > MRI.createVirtualRegister(&AMDGPU::R600_Reg128RegClass);
> >>       MachineOperand &RID = MI->getOperand(4);
> >>  -- 
> >>  1.8.3.1
> > 
> >> 
> >
-------------- next part --------------
>From fa5abaa99d923fd429029ed149805063848f3184 Mon Sep 17 00:00:00 2001
From: Tom Stellard <thomas.stellard at amd.com>
Date: Thu, 5 Sep 2013 08:59:32 -0700
Subject: [PATCH] R600: Fix scheduling of instructions that use the LDS output
 queue

The LDS output queue is accessed via the OQAP register.  The OQAP
register cannot be live across clauses, so if value is written to the
output queue, it must be retrieved before the end of the clause.
With the machine scheduler, we cannot statisfy this constraint, because
it lacks proper alias analysis and it will mark some LDS accesses as
having a chain dependency on vertex fetches.  Since vertex fetches
require a new clauses, the dependency may end up spiltting OQAP uses and
defs so the end up in different clauses.  See the lds-output-queue.ll
test for a more detailed explanation.

To work around this issue, we now combine the LDS read and the OQAP
copy into one instruction and expand it after register allocation.

This patch also adds some checks to the EmitClauseMarker pass, so that
it doesn't end a clause with a value still in the output queue and
removes AR.X and OQAP handling from the scheduler (AR.X uses and defs
were already being expanded post-RA, so the scheduler will never see
them).
---
 lib/Target/R600/R600EmitClauseMarkers.cpp     | 65 ++++++++++++++++--
 lib/Target/R600/R600ExpandSpecialInstrs.cpp   | 17 +++++
 lib/Target/R600/R600ISelLowering.cpp          | 20 +++---
 lib/Target/R600/R600InstrInfo.cpp             |  8 +++
 lib/Target/R600/R600InstrInfo.h               |  2 +
 lib/Target/R600/R600MachineScheduler.cpp      | 32 ---------
 lib/Target/R600/R600MachineScheduler.h        |  2 -
 lib/Target/R600/R600RegisterInfo.cpp          | 13 ++++
 lib/Target/R600/R600RegisterInfo.h            |  2 +
 test/CodeGen/R600/lds-output-queue.ll         | 99 +++++++++++++++++++++++++++
 test/CodeGen/R600/local-memory-two-objects.ll |  8 ++-
 11 files changed, 215 insertions(+), 53 deletions(-)
 create mode 100644 test/CodeGen/R600/lds-output-queue.ll

diff --git a/lib/Target/R600/R600EmitClauseMarkers.cpp b/lib/Target/R600/R600EmitClauseMarkers.cpp
index 928c0e3..881b4aa 100644
--- a/lib/Target/R600/R600EmitClauseMarkers.cpp
+++ b/lib/Target/R600/R600EmitClauseMarkers.cpp
@@ -47,6 +47,11 @@ private:
       break;
     }
 
+    // These will be expanded to two ALU instructions in the
+    // ExpandSpecialInstructions pass.
+    if (TII->isLDSRetInstr(MI->getOpcode()))
+      return 2;
+
     if(TII->isVector(*MI) ||
         TII->isCubeOp(MI->getOpcode()) ||
         TII->isReductionOp(MI->getOpcode()))
@@ -108,6 +113,10 @@ private:
   bool SubstituteKCacheBank(MachineInstr *MI,
       std::vector<std::pair<unsigned, unsigned> > &CachedConsts) const {
     std::vector<std::pair<unsigned, unsigned> > UsedKCache;
+
+    if (!TII->isALUInstr(MI->getOpcode()) && MI->getOpcode() != AMDGPU::DOT_4)
+      return true;
+
     const SmallVectorImpl<std::pair<MachineOperand *, int64_t> > &Consts =
         TII->getSrcs(MI);
     assert((TII->isALUInstr(MI->getOpcode()) ||
@@ -160,6 +169,52 @@ private:
     return true;
   }
 
+  bool canClauseLocalKillFitInClause(
+                        unsigned AluInstCount,
+                        std::vector<std::pair<unsigned, unsigned> > KCacheBanks,
+                        MachineBasicBlock::iterator Def,
+                        MachineBasicBlock::iterator BBEnd) {
+    const R600RegisterInfo &TRI = TII->getRegisterInfo();
+    for (MachineInstr::const_mop_iterator
+           MOI = Def->operands_begin(),
+           MOE = Def->operands_end(); MOI != MOE; ++MOI) {
+      if (!MOI->isReg() || !MOI->isDef() ||
+          TRI.isPhysRegLiveAcrossClauses(MOI->getReg()))
+        continue;
+
+      // Def defines a clause local register, so check that its use will fit
+      // in the clause.
+      unsigned LastUseCount = 0;
+      for (MachineBasicBlock::iterator UseI = Def; UseI != BBEnd; ++UseI) {
+        AluInstCount += OccupiedDwords(UseI);
+        // Make sure we won't need to end the clause due to KCache limitations.
+        if (!SubstituteKCacheBank(UseI, KCacheBanks))
+          return false;
+
+        // We have reached the maximum instruction limit before finding the
+        // use that kills this register, so we cannot use this def in the
+        // current clause.
+        if (AluInstCount >= TII->getMaxAlusPerClause())
+          return false;
+
+        // Register kill flags have been cleared by the time we get to this
+        // pass, but it is safe to assume that all uses of this register
+        // occur in the same basic block as its definition, because
+        // it is illegal for the scheduler to schedule them in
+        // different blocks.
+        if (UseI->findRegisterUseOperandIdx(MOI->getReg()))
+          LastUseCount = AluInstCount;
+
+        if (UseI != Def && UseI->findRegisterDefOperandIdx(MOI->getReg()) != -1)
+          break;
+      }
+      if (LastUseCount)
+        return LastUseCount <= TII->getMaxAlusPerClause();
+      llvm_unreachable("Clause local register live at end of clause.");
+    }
+    return true;
+  }
+
   MachineBasicBlock::iterator
   MakeALUClause(MachineBasicBlock &MBB, MachineBasicBlock::iterator I) {
     MachineBasicBlock::iterator ClauseHead = I;
@@ -198,11 +253,13 @@ private:
         I++;
         break;
       }
-      if (TII->isALUInstr(I->getOpcode()) &&
-          !SubstituteKCacheBank(I, KCacheBanks))
+
+      // If this instruction defines a clause local register, make sure
+      // its use can fit in this clause.
+      if (!canClauseLocalKillFitInClause(AluInstCount, KCacheBanks, I, E))
         break;
-      if (I->getOpcode() == AMDGPU::DOT_4 &&
-          !SubstituteKCacheBank(I, KCacheBanks))
+
+      if (!SubstituteKCacheBank(I, KCacheBanks))
         break;
       AluInstCount += OccupiedDwords(I);
     }
diff --git a/lib/Target/R600/R600ExpandSpecialInstrs.cpp b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
index 67b42d7..aeee4aa 100644
--- a/lib/Target/R600/R600ExpandSpecialInstrs.cpp
+++ b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
@@ -68,6 +68,23 @@ bool R600ExpandSpecialInstrsPass::runOnMachineFunction(MachineFunction &MF) {
       MachineInstr &MI = *I;
       I = llvm::next(I);
 
+      // Expand LDS_*_RET instructions
+      if (TII->isLDSRetInstr(MI.getOpcode())) {
+        int DstIdx = TII->getOperandIdx(MI.getOpcode(), AMDGPU::OpName::dst);
+        assert(DstIdx != -1);
+        MachineOperand &DstOp = MI.getOperand(DstIdx);
+        MachineInstr *Mov = TII->buildMovInstr(&MBB, I,
+                                               DstOp.getReg(), AMDGPU::OQAP);
+        DstOp.setReg(AMDGPU::OQAP);
+        int LDSPredSelIdx = TII->getOperandIdx(MI.getOpcode(),
+                                           AMDGPU::OpName::pred_sel);
+        int MovPredSelIdx = TII->getOperandIdx(Mov->getOpcode(),
+                                           AMDGPU::OpName::pred_sel);
+        // Copy the pred_sel bit
+        Mov->getOperand(MovPredSelIdx).setReg(
+            MI.getOperand(LDSPredSelIdx).getReg());
+      }
+
       switch (MI.getOpcode()) {
       default: break;
       // Expand PRED_X to one of the PRED_SET instructions.
diff --git a/lib/Target/R600/R600ISelLowering.cpp b/lib/Target/R600/R600ISelLowering.cpp
index 5bb8129..9845d12 100644
--- a/lib/Target/R600/R600ISelLowering.cpp
+++ b/lib/Target/R600/R600ISelLowering.cpp
@@ -128,21 +128,17 @@ MachineBasicBlock * R600TargetLowering::EmitInstrWithCustomInserter(
 
   switch (MI->getOpcode()) {
   default:
-    if (TII->isLDSInstr(MI->getOpcode()) &&
-        TII->getOperandIdx(MI->getOpcode(), AMDGPU::OpName::dst) != -1) {
+    // Replace LDS_*_RET instruction that don't have any uses with the
+    // equivalent LDS_*_NORET instruction.
+    if (TII->isLDSRetInstr(MI->getOpcode())) {
       int DstIdx = TII->getOperandIdx(MI->getOpcode(), AMDGPU::OpName::dst);
       assert(DstIdx != -1);
       MachineInstrBuilder NewMI;
-      if (!MRI.use_empty(MI->getOperand(DstIdx).getReg())) {
-        NewMI = BuildMI(*BB, I, BB->findDebugLoc(I), TII->get(MI->getOpcode()),
-                        AMDGPU::OQAP);
-        TII->buildDefaultInstruction(*BB, I, AMDGPU::MOV,
-                                     MI->getOperand(0).getReg(),
-                                     AMDGPU::OQAP);
-      } else {
-        NewMI = BuildMI(*BB, I, BB->findDebugLoc(I),
-                        TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode())));
-      }
+      if (!MRI.use_empty(MI->getOperand(DstIdx).getReg()))
+        return BB;
+
+      NewMI = BuildMI(*BB, I, BB->findDebugLoc(I),
+                      TII->get(AMDGPU::getLDSNoRetOp(MI->getOpcode())));
       for (unsigned i = 1, e = MI->getNumOperands(); i < e; ++i) {
         NewMI.addOperand(MI->getOperand(i));
       }
diff --git a/lib/Target/R600/R600InstrInfo.cpp b/lib/Target/R600/R600InstrInfo.cpp
index a11d54a..0ae4d0b 100644
--- a/lib/Target/R600/R600InstrInfo.cpp
+++ b/lib/Target/R600/R600InstrInfo.cpp
@@ -141,6 +141,14 @@ bool R600InstrInfo::isLDSInstr(unsigned Opcode) const {
           (TargetFlags & R600_InstFlag::LDS_1A2D));
 }
 
+bool R600InstrInfo::isLDSNoRetInstr(unsigned Opcode) const {
+  return isLDSInstr(Opcode) && getOperandIdx(Opcode, AMDGPU::OpName::dst) == -1;
+}
+
+bool R600InstrInfo::isLDSRetInstr(unsigned Opcode) const {
+  return isLDSInstr(Opcode) && getOperandIdx(Opcode, AMDGPU::OpName::dst) != -1;
+}
+
 bool R600InstrInfo::canBeConsideredALU(const MachineInstr *MI) const {
   if (isALUInstr(MI->getOpcode()))
     return true;
diff --git a/lib/Target/R600/R600InstrInfo.h b/lib/Target/R600/R600InstrInfo.h
index d7438ef..d6d1fe8 100644
--- a/lib/Target/R600/R600InstrInfo.h
+++ b/lib/Target/R600/R600InstrInfo.h
@@ -65,6 +65,8 @@ namespace llvm {
   bool isALUInstr(unsigned Opcode) const;
   bool hasInstrModifiers(unsigned Opcode) const;
   bool isLDSInstr(unsigned Opcode) const;
+  bool isLDSNoRetInstr(unsigned Opcode) const;
+  bool isLDSRetInstr(unsigned Opcode) const;
 
   /// \returns true if this \p Opcode represents an ALU instruction or an
   /// instruction that will be lowered in ExpandSpecialInstrs Pass.
diff --git a/lib/Target/R600/R600MachineScheduler.cpp b/lib/Target/R600/R600MachineScheduler.cpp
index 6c26d9e..da2a4d8 100644
--- a/lib/Target/R600/R600MachineScheduler.cpp
+++ b/lib/Target/R600/R600MachineScheduler.cpp
@@ -92,15 +92,6 @@ SUnit* R600SchedStrategy::pickNode(bool &IsTopNode) {
       AllowSwitchFromAlu = true;
   }
 
-
-  // We want to scheduled AR defs as soon as possible to make sure they aren't
-  // put in a different ALU clause from their uses.
-  if (!SU && !UnscheduledARDefs.empty()) {
-      SU = UnscheduledARDefs[0];
-      UnscheduledARDefs.erase(UnscheduledARDefs.begin());
-      NextInstKind = IDAlu;
-  }
-
   if (!SU && ((AllowSwitchToAlu && CurInstKind != IDAlu) ||
       (!AllowSwitchFromAlu && CurInstKind == IDAlu))) {
     // try to pick ALU
@@ -130,15 +121,6 @@ SUnit* R600SchedStrategy::pickNode(bool &IsTopNode) {
       NextInstKind = IDOther;
   }
 
-  // We want to schedule the AR uses as late as possible to make sure that
-  // the AR defs have been released.
-  if (!SU && !UnscheduledARUses.empty()) {
-      SU = UnscheduledARUses[0];
-      UnscheduledARUses.erase(UnscheduledARUses.begin());
-      NextInstKind = IDAlu;
-  }
-
-
   DEBUG(
       if (SU) {
         dbgs() << " ** Pick node **\n";
@@ -217,20 +199,6 @@ void R600SchedStrategy::releaseBottomNode(SUnit *SU) {
 
   int IK = getInstKind(SU);
 
-  // Check for AR register defines
-  for (MachineInstr::const_mop_iterator I = SU->getInstr()->operands_begin(),
-                                        E = SU->getInstr()->operands_end();
-                                        I != E; ++I) {
-    if (I->isReg() && I->getReg() == AMDGPU::AR_X) {
-      if (I->isDef()) {
-        UnscheduledARDefs.push_back(SU);
-      } else {
-        UnscheduledARUses.push_back(SU);
-      }
-      return;
-    }
-  }
-
   // There is no export clause, we can schedule one as soon as its ready
   if (IK == IDOther)
     Available[IDOther].push_back(SU);
diff --git a/lib/Target/R600/R600MachineScheduler.h b/lib/Target/R600/R600MachineScheduler.h
index 0a6f120..97c8cde 100644
--- a/lib/Target/R600/R600MachineScheduler.h
+++ b/lib/Target/R600/R600MachineScheduler.h
@@ -53,8 +53,6 @@ class R600SchedStrategy : public MachineSchedStrategy {
 
   std::vector<SUnit *> Available[IDLast], Pending[IDLast];
   std::vector<SUnit *> AvailableAlus[AluLast];
-  std::vector<SUnit *> UnscheduledARDefs;
-  std::vector<SUnit *> UnscheduledARUses;
   std::vector<SUnit *> PhysicalRegCopy;
 
   InstKind CurInstKind;
diff --git a/lib/Target/R600/R600RegisterInfo.cpp b/lib/Target/R600/R600RegisterInfo.cpp
index dd8f3ef..83fc90b 100644
--- a/lib/Target/R600/R600RegisterInfo.cpp
+++ b/lib/Target/R600/R600RegisterInfo.cpp
@@ -85,3 +85,16 @@ const RegClassWeight &R600RegisterInfo::getRegClassWeight(
   const TargetRegisterClass *RC) const {
   return RCW;
 }
+
+bool R600RegisterInfo::isPhysRegLiveAcrossClauses(unsigned Reg) const {
+  assert(!TargetRegisterInfo::isVirtualRegister(Reg));
+
+  switch (Reg) {
+  case AMDGPU::OQAP:
+  case AMDGPU::OQBP:
+  case AMDGPU::AR_X:
+    return false;
+  default:
+    return true;
+  }
+}
diff --git a/lib/Target/R600/R600RegisterInfo.h b/lib/Target/R600/R600RegisterInfo.h
index d458e55..18f3b3e 100644
--- a/lib/Target/R600/R600RegisterInfo.h
+++ b/lib/Target/R600/R600RegisterInfo.h
@@ -45,6 +45,8 @@ struct R600RegisterInfo : public AMDGPURegisterInfo {
 
   virtual const RegClassWeight &getRegClassWeight(const TargetRegisterClass *RC) const;
 
+  // \returns true if \p Reg can be defined in one ALU caluse and used in another.
+  virtual bool isPhysRegLiveAcrossClauses(unsigned Reg) const;
 };
 
 } // End namespace llvm
diff --git a/test/CodeGen/R600/lds-output-queue.ll b/test/CodeGen/R600/lds-output-queue.ll
new file mode 100644
index 0000000..63a4332
--- /dev/null
+++ b/test/CodeGen/R600/lds-output-queue.ll
@@ -0,0 +1,99 @@
+; RUN: llc < %s -march=r600 -mcpu=redwood -verify-machineinstrs | FileCheck %s
+;
+; This test checks that the lds input queue will is empty at the end of
+; the ALU clause.
+
+; CHECK-LABEL: @lds_input_queue
+; CHECK: LDS_READ_RET * OQAP
+; CHECK-NOT: ALU clause
+; CHECK: MOV * T{{[0-9]\.[XYZW]}}, OQAP
+
+ at local_mem = internal addrspace(3) unnamed_addr global [2 x i32] [i32 1, i32 2], align 4
+
+define void @lds_input_queue(i32 addrspace(1)* %out, i32 addrspace(1)* %in, i32 %index) {
+entry:
+  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, i32 %index
+  %1 = load i32 addrspace(3)* %0
+  call void @llvm.AMDGPU.barrier.local()
+
+  ; This will start a new clause for the vertex fetch
+  %2 = load i32 addrspace(1)* %in
+  %3 = add i32 %1, %2
+  store i32 %3, i32 addrspace(1)* %out
+  ret void
+}
+
+declare void @llvm.AMDGPU.barrier.local()
+
+; The machine scheduler does not do proper alias analysis and assumes that
+; loads from global values (Note that a global value is different that a
+; value from global memory.  A global value is a value that is declared
+; outside of a function, it can reside in any address space) alias with
+; all other loads.
+;
+; This is a problem for scheduling the reads from the local data share (lds).
+; These reads are implemented using two instructions.  The first copies the
+; data from lds into the lds output queue, and the second moves the data from
+; the input queue into main memory.  These two instructions don't have to be
+; scheduled one after the other, but they do need to be scheduled in the same
+; clause.  The aliasing problem mentioned above causes problems when there is a
+; load from global memory which immediately follows a load from a global value that
+; has been declared in the local memory space:
+;
+;  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, i32 %index
+;  %1 = load i32 addrspace(3)* %0
+;  %2 = load i32 addrspace(1)* %in
+;
+; The instruction selection phase will generate ISA that looks like this:
+; %OQAP = LDS_READ_RET
+; %vreg0 = MOV %OQAP
+; %vreg1 = VTX_READ_32
+; %vreg2 = ADD_INT %vreg1, %vreg0
+;
+; The bottom scheduler will schedule the two ALU instructions first:
+;
+; UNSCHEDULED:
+; %OQAP = LDS_READ_RET
+; %vreg1 = VTX_READ_32
+;
+; SCHEDULED:
+;
+; vreg0 = MOV %OQAP
+; vreg2 = ADD_INT %vreg1, %vreg2
+;
+; The lack of proper aliasing results in the local memory read (LDS_READ_RET)
+; to consider the global memory read (VTX_READ_32) has a chain dependency, so
+; the global memory read will always be scheduled first.  This will give us a
+; final program which looks like this:
+;
+; Alu clause:
+; %OQAP = LDS_READ_RET
+; VTX clause:
+; %vreg1 = VTX_READ_32
+; Alu clause:
+; vreg0 = MOV %OQAP
+; vreg2 = ADD_INT %vreg1, %vreg2
+;
+; This is an illegal program because the OQAP def and use know occur in
+; different ALU clauses.
+;
+; This test checks this scenario and makes sure it doesn't result in an
+; illegal program.  For now, we have fixed this issue by merging the
+; LDS_READ_RET and MOV together during instruction selection and then
+; expanding them after scheduling.  Once the scheduler has better alias
+; analysis, we should be able to keep these instructions sparate before
+; scheduling.
+;
+; CHECK-LABEL: @local_global_alias
+; CHECK: LDS_READ_RET
+; CHECK-NOT: ALU clause
+; CHECK MOV * T{{[0-9]\.[XYZW]}}, OQAP
+define void @local_global_alias(i32 addrspace(1)* %out, i32 addrspace(1)* %in) {
+entry:
+  %0 = getelementptr inbounds [2 x i32] addrspace(3)* @local_mem, i32 0, i32 0
+  %1 = load i32 addrspace(3)* %0
+  %2 = load i32 addrspace(1)* %in
+  %3 = add i32 %2, %1
+  store i32 %3, i32 addrspace(1)* %out
+  ret void
+}
diff --git a/test/CodeGen/R600/local-memory-two-objects.ll b/test/CodeGen/R600/local-memory-two-objects.ll
index b413fe3..e2d8406 100644
--- a/test/CodeGen/R600/local-memory-two-objects.ll
+++ b/test/CodeGen/R600/local-memory-two-objects.ll
@@ -12,9 +12,11 @@
 ; SI-CHECK: .long 47180
 ; SI-CHECK-NEXT: .long 32768
 
-; Make sure the lds writes are using different addresses.
-; EG-CHECK: LDS_WRITE {{[*]*}} {{PV|T}}[[ADDRW:[0-9]*\.[XYZW]]]
-; EG-CHECK-NOT: LDS_WRITE {{[*]*}} T[[ADDRW]]
+; We would like to check the the lds writes are using different
+; addresses, but due to variations in the scheduler, we can't do
+; this consistently on evergreen GPUs.
+; EG-CHECK: LDS_WRITE
+; EG-CHECK: LDS_WRITE
 ; SI-CHECK: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW:[0-9]*]]
 ; SI-CHECK-NOT: DS_WRITE_B32 0, {{v[0-9]*}}, v[[ADDRW]]
 
-- 
1.8.1.4