PATCHES: R600/SI: Enable machine scheduler

Wed Dec 3 12:44:16 PST 2014

On Tue, Dec 02, 2014 at 05:52:32PM -0500, Matt Arsenault wrote:
> 
> >0003-R600-SI-Set-MayStore-0-on-MUBUF-loads.patch
> >
> >
> > From c0688d069d3de09a75241ec2f0f0cc605320f3f8 Mon Sep 17 00:00:00 2001
> >From: Tom Stellard<thomas.stellard at amd.com>
> >Date: Mon, 24 Nov 2014 20:56:28 +0000
> >Subject: [PATCH 3/6] R600/SI: Set MayStore = 0 on MUBUF loads
> >
> >---
> >  lib/Target/R600/SIInstrInfo.td | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
> >index 392c272..8cc84c9 100644
> >--- a/lib/Target/R600/SIInstrInfo.td
> >+++ b/lib/Target/R600/SIInstrInfo.td
> >@@ -1187,7 +1187,7 @@ multiclass MUBUF_Load_Helper <bits<7> op, string asm, RegisterClass regClass,
> >                                ValueType load_vt = i32,
> >                                SDPatternOperator ld = null_frag> {
> >-  let lds = 0, mayLoad = 1 in {
> >+  let lds = 0, mayLoad = 1, mayStore = 0 in {
> >      let addr64 = 0 in {
> >-- 2.0.4
> LGTM. Shouldn't this be getting set in the base MUBUF instruction
> class? Why do these need to be set again here?
> 
> 
> >
> >0004-R600-SI-Spill-VGPRs-to-scratch-space-for-compute-sha.patch
> >
> >
> > From 2688eabde67974f5720c9abc99777a97923c5e8e Mon Sep 17 00:00:00 2001
> >From: Tom Stellard<thomas.stellard at amd.com>
> >Date: Thu, 27 Nov 2014 21:13:48 +0000
> >Subject: [PATCH 4/6] R600/SI: Spill VGPRs to scratch space for compute shaders
> >
> >---
> >  lib/Target/R600/AMDGPU.h                  |   1 +
> >  lib/Target/R600/AMDGPUTargetMachine.cpp   |   1 +
> >  lib/Target/R600/CMakeLists.txt            |   1 +
> >  lib/Target/R600/SIInstrInfo.cpp           |  33 ++++-
> >  lib/Target/R600/SIInstrInfo.td            |   2 +
> >  lib/Target/R600/SIInstructions.td         |  10 +-
> >  lib/Target/R600/SIMachineFunctionInfo.cpp |   3 +-
> >  lib/Target/R600/SIMachineFunctionInfo.h   |   3 +
> >  lib/Target/R600/SIPrepareScratchRegs.cpp  | 198 ++++++++++++++++++++++++++++++
> >  lib/Target/R600/SIRegisterInfo.cpp        | 156 +++++++++++++----------
> >  lib/Target/R600/SIRegisterInfo.h          |   9 +-
> >  11 files changed, 344 insertions(+), 73 deletions(-)
> >  create mode 100644 lib/Target/R600/SIPrepareScratchRegs.cpp
> >
> >diff --git a/lib/Target/R600/AMDGPU.h b/lib/Target/R600/AMDGPU.h
> >index 13379e7..6819808 100644
> >--- a/lib/Target/R600/AMDGPU.h
> >+++ b/lib/Target/R600/AMDGPU.h
> >@@ -47,6 +47,7 @@ FunctionPass *createSIFixSGPRCopiesPass(TargetMachine &tm);
> >  FunctionPass *createSIFixSGPRLiveRangesPass();
> >  FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);
> >  FunctionPass *createSIInsertWaits(TargetMachine &tm);
> >+FunctionPass *createSIPrepareScratchRegs();
> >  void initializeSIFoldOperandsPass(PassRegistry &);
> >  extern char &SIFoldOperandsID;
> >diff --git a/lib/Target/R600/AMDGPUTargetMachine.cpp b/lib/Target/R600/AMDGPUTargetMachine.cpp
> >index d4ee738..c035af0 100644
> >--- a/lib/Target/R600/AMDGPUTargetMachine.cpp
> >+++ b/lib/Target/R600/AMDGPUTargetMachine.cpp
> >@@ -190,6 +190,7 @@ bool AMDGPUPassConfig::addPostRegAlloc() {
> >    const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>();
> >    if (ST.getGeneration() > AMDGPUSubtarget::NORTHERN_ISLANDS) {
> >+    addPass(createSIPrepareScratchRegs());
> >      addPass(createSIShrinkInstructionsPass());
> >    }
> >    return false;
> >diff --git a/lib/Target/R600/CMakeLists.txt b/lib/Target/R600/CMakeLists.txt
> >index 3b703e7..5a4bae2 100644
> >--- a/lib/Target/R600/CMakeLists.txt
> >+++ b/lib/Target/R600/CMakeLists.txt
> >@@ -51,6 +51,7 @@ add_llvm_target(R600CodeGen
> >    SILowerControlFlow.cpp
> >    SILowerI1Copies.cpp
> >    SIMachineFunctionInfo.cpp
> >+  SIPrepareScratchRegs.cpp
> >    SIRegisterInfo.cpp
> >    SIShrinkInstructions.cpp
> >    SITypeRewriter.cpp
> >diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp
> >index 1a0010c..acdb0fa 100644
> >--- a/lib/Target/R600/SIInstrInfo.cpp
> >+++ b/lib/Target/R600/SIInstrInfo.cpp
> >@@ -426,8 +426,7 @@ static bool shouldTryToSpillVGPRs(MachineFunction *MF) {
> >    // FIXME: Even though it can cause problems, we need to enable
> >    // spilling at -O0, since the fast register allocator always
> >    // spills registers that are live at the end of blocks.
> >-  return MFI->getShaderType() == ShaderType::COMPUTE &&
> >-         TM.getOptLevel() == CodeGenOpt::None;
> >+  return MFI->getShaderType() == ShaderType::COMPUTE;
> I still don't think conditionally enabling spilling makes any sense.
> It just changes how it breaks for non-compute shaders.

It's better to disable it in the compiler since then you get an error message instead
of a GPU hang.  The driver setup for spilling is different for other shader types,
and I still need to implement it in Mesa.

> 
> >  }
> >@@ -438,7 +437,9 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,
> >                                        const TargetRegisterClass *RC,
> >                                        const TargetRegisterInfo *TRI) const {
> >    MachineFunction *MF = MBB.getParent();
> >+  SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();
> >    MachineFrameInfo *FrameInfo = MF->getFrameInfo();
> >+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
> >    DebugLoc DL = MBB.findDebugLoc(MI);
> >    int Opcode = -1;
> >@@ -454,6 +455,19 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,
> >        case 512: Opcode = AMDGPU::SI_SPILL_S512_SAVE; break;
> >      }
> >    } else if(shouldTryToSpillVGPRs(MF) && RI.hasVGPRs(RC)) {
> >+    MFI->setHasSpilledVGPRs();
> >+#if 0
> >+    unsigned ScratchPtr =
> >+        RI.getPreloadedValue(*MF, SIRegisterInfo::SCRATCH_PTR);
> >+    unsigned ScratchOffset =
> >+        RI.getPreloadedValue(*MF, SIRegisterInfo::SCRATCH_WAVE_OFFSET);
> >+    if (!MRI.isLiveIn(ScratchPtr))
> >+      MRI.addLiveIn(ScratchPtr);
> >+
> >+    if (!MRI.isLiveIn(ScratchOffset))
> >+      MRI.addLiveIn(ScratchOffset);
> >+#endif
> >+
> Dead code
> 
> >      switch(RC->getSize() * 8) {
> >        case 32: Opcode = AMDGPU::SI_SPILL_V32_SAVE; break;
> >        case 64: Opcode = AMDGPU::SI_SPILL_V64_SAVE; break;
> >@@ -468,7 +482,11 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,
> >      FrameInfo->setObjectAlignment(FrameIndex, 4);
> >      BuildMI(MBB, MI, DL, get(Opcode))
> >              .addReg(SrcReg)
> >-            .addFrameIndex(FrameIndex);
> >+            .addFrameIndex(FrameIndex)
> >+            // Place-holder registers, these will be filled in by
> >+            // SIPrepareScratchRegs.
> Why do you need to reserve physical registers for this? Can you
> define virtual registers with IMPLICIT_DEF, or virtual registers
> with
> a spill register class for the pseudo instructions?

My experience has been that things break if you create virtual registers during
spilling.  I don't think the register allocator is set up to handle this.

> >+            .addReg(AMDGPU::SGPR0_SGPR1, RegState::Undef)
> >+            .addReg(AMDGPU::SGPR0, RegState::Undef);
> >    } else {
> >      LLVMContext &Ctx = MF->getFunction()->getContext();
> >      Ctx.emitError("SIInstrInfo::storeRegToStackSlot - Do not know how to"
> >@@ -510,7 +528,12 @@ void SIInstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,
> >    if (Opcode != -1) {
> >      FrameInfo->setObjectAlignment(FrameIndex, 4);
> >      BuildMI(MBB, MI, DL, get(Opcode), DestReg)
> >-            .addFrameIndex(FrameIndex);
> >+            .addFrameIndex(FrameIndex)
> >+            // Place-holder registers, these will be filled in by
> >+            // SIPrepareScratchRegs.
> >+            .addReg(AMDGPU::SGPR0_SGPR1, RegState::Undef)
> >+            .addReg(AMDGPU::SGPR0, RegState::Undef);
> >+
> >    } else {
> >      LLVMContext &Ctx = MF->getFunction()->getContext();
> >      Ctx.emitError("SIInstrInfo::loadRegFromStackSlot - Do not know how to"
> >@@ -541,7 +564,7 @@ unsigned SIInstrInfo::calculateLDSSpillAddress(MachineBasicBlock &MBB,
> >      MachineBasicBlock::iterator Insert = Entry.front();
> >      DebugLoc DL = Insert->getDebugLoc();
> >-    TIDReg = RI.findUnusedVGPR(MF->getRegInfo());
> >+    TIDReg = RI.findUnusedRegister(MF->getRegInfo(), &AMDGPU::VGPR_32RegClass);
> >      if (TIDReg == AMDGPU::NoRegister)
> >        return TIDReg;
> >diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
> >index 8cc84c9..afdc200 100644
> >--- a/lib/Target/R600/SIInstrInfo.td
> >+++ b/lib/Target/R600/SIInstrInfo.td
> >@@ -1240,6 +1240,7 @@ multiclass MUBUF_Load_Helper <bits<7> op, string asm, RegisterClass regClass,
> >  multiclass MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass,
> >                            ValueType store_vt, SDPatternOperator st> {
> >+  let mayLoad = 0, mayStore = 1 in {
> >    let addr64 = 0, lds = 0 in {
> >      def "" : MUBUF <
> >@@ -1298,6 +1299,7 @@ multiclass MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass
> >        let tfe = 0;
> >        let soffset = 128; // ZERO
> >     }
> >+   } // End mayLoad = 0, mayStore = 1
> >  }
> >  class FLAT_Load_Helper <bits<7> op, string asm, RegisterClass regClass> :
> >diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> >index 00ce9bf..3a969e7 100644
> >--- a/lib/Target/R600/SIInstructions.td
> >+++ b/lib/Target/R600/SIInstructions.td
> >@@ -1856,13 +1856,14 @@ multiclass SI_SPILL_SGPR <RegisterClass sgpr_class> {
> >    def _SAVE : InstSI <
> >      (outs),
> >-    (ins sgpr_class:$src, i32imm:$frame_idx),
> >+    (ins sgpr_class:$src, i32imm:$frame_idx, SReg_64:$scratch_ptr,
> >+         SReg_32:$scratch_offset),
> >      "", []
> >    >;
> >    def _RESTORE : InstSI <
> >      (outs sgpr_class:$dst),
> >-    (ins i32imm:$frame_idx),
> >+    (ins i32imm:$frame_idx, SReg_64:$scratch_ptr, SReg_32:$scratch_offset),
> >      "", []
> >    >;
> >@@ -1877,13 +1878,14 @@ defm SI_SPILL_S512 : SI_SPILL_SGPR <SReg_512>;
> >  multiclass SI_SPILL_VGPR <RegisterClass vgpr_class> {
> >    def _SAVE : InstSI <
> >      (outs),
> >-    (ins vgpr_class:$src, i32imm:$frame_idx),
> >+    (ins vgpr_class:$src, i32imm:$frame_idx, SReg_64:$scratch_ptr,
> >+         SReg_32:$scratch_offset),
> >      "", []
> >    >;
> >    def _RESTORE : InstSI <
> >      (outs vgpr_class:$dst),
> >-    (ins i32imm:$frame_idx),
> >+    (ins i32imm:$frame_idx, SReg_64:$scratch_ptr, SReg_32:$scratch_offset),
> >      "", []
> >    >;
> >  }
> >diff --git a/lib/Target/R600/SIMachineFunctionInfo.cpp b/lib/Target/R600/SIMachineFunctionInfo.cpp
> >index d58f31d..198dd56 100644
> >--- a/lib/Target/R600/SIMachineFunctionInfo.cpp
> >+++ b/lib/Target/R600/SIMachineFunctionInfo.cpp
> >@@ -29,6 +29,7 @@ void SIMachineFunctionInfo::anchor() {}
> >  SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)
> >    : AMDGPUMachineFunction(MF),
> >      TIDReg(AMDGPU::NoRegister),
> >+    HasSpilledVGPRs(false),
> >      PSInputAddr(0),
> >      NumUserSGPRs(0),
> >      LDSWaveSpillSize(0) { }
> >@@ -50,7 +51,7 @@ SIMachineFunctionInfo::SpilledReg SIMachineFunctionInfo::getSpilledReg(
> >    struct SpilledReg Spill;
> >    if (!LaneVGPRs.count(LaneVGPRIdx)) {
> >-    unsigned LaneVGPR = TRI->findUnusedVGPR(MRI);
> >+    unsigned LaneVGPR = TRI->findUnusedRegister(MRI, &AMDGPU::VGPR_32RegClass);
> >      LaneVGPRs[LaneVGPRIdx] = LaneVGPR;
> >      MRI.setPhysRegUsed(LaneVGPR);
> >diff --git a/lib/Target/R600/SIMachineFunctionInfo.h b/lib/Target/R600/SIMachineFunctionInfo.h
> >index 6bb8f9d..7185271 100644
> >--- a/lib/Target/R600/SIMachineFunctionInfo.h
> >+++ b/lib/Target/R600/SIMachineFunctionInfo.h
> >@@ -29,6 +29,7 @@ class SIMachineFunctionInfo : public AMDGPUMachineFunction {
> >    void anchor() override;
> >    unsigned TIDReg;
> >+  bool HasSpilledVGPRs;
> >  public:
> >@@ -52,6 +53,8 @@ public:
> >    bool hasCalculatedTID() const { return TIDReg != AMDGPU::NoRegister; };
> >    unsigned getTIDReg() const { return TIDReg; };
> >    void setTIDReg(unsigned Reg) { TIDReg = Reg; }
> >+  bool hasSpilledVGPRs() const { return HasSpilledVGPRs; }
> >+  void setHasSpilledVGPRs(bool Spill = true) { HasSpilledVGPRs = Spill; }
> >    unsigned getMaximumWorkGroupSize(const MachineFunction &MF) const;
> >  };
> >diff --git a/lib/Target/R600/SIPrepareScratchRegs.cpp b/lib/Target/R600/SIPrepareScratchRegs.cpp
> >new file mode 100644
> >index 0000000..32010f0
> >--- /dev/null
> >+++ b/lib/Target/R600/SIPrepareScratchRegs.cpp
> >@@ -0,0 +1,198 @@
> >+//===-- SIPrepareScratchRegs.cpp - Use predicates for control flow --------===//
> >+//
> >+//                     The LLVM Compiler Infrastructure
> >+//
> >+// This file is distributed under the University of Illinois Open Source
> >+// License. See LICENSE.TXT for details.
> >+//
> >+//===----------------------------------------------------------------------===//
> >+//
> >+/// \file
> >+///
> >+/// This pass loads scratch pointer and scratch offset into a register or a
> >+/// frame index which can be used anywhere in the program.  These values will
> >+/// be used for spilling VGPRs.
> >+///
> >+//===----------------------------------------------------------------------===//
> >+
> >+#include "AMDGPU.h"
> >+#include "AMDGPUSubtarget.h"
> >+#include "SIDefines.h"
> >+#include "SIInstrInfo.h"
> >+#include "SIMachineFunctionInfo.h"
> >+#include "llvm/CodeGen/MachineFrameInfo.h"
> >+#include "llvm/CodeGen/MachineFunction.h"
> >+#include "llvm/CodeGen/MachineFunctionPass.h"
> >+#include "llvm/CodeGen/MachineInstrBuilder.h"
> >+#include "llvm/CodeGen/MachineRegisterInfo.h"
> >+#include "llvm/CodeGen/RegisterScavenging.h"
> >+#include "llvm/IR/Function.h"
> >+#include "llvm/IR/LLVMContext.h"
> >+
> >+using namespace llvm;
> >+
> >+namespace {
> >+
> >+class SIPrepareScratchRegs : public MachineFunctionPass {
> >+
> >+private:
> >+  static char ID;
> >+
> >+public:
> >+  SIPrepareScratchRegs() : MachineFunctionPass(ID) { }
> >+
> >+  bool runOnMachineFunction(MachineFunction &MF) override;
> >+
> >+  const char *getPassName() const override {
> >+    return "SI prepare scratch registers";
> >+  }
> >+
> >+};
> >+
> >+} // End anonymous namespace
> >+
> >+char SIPrepareScratchRegs::ID = 0;
> >+
> >+FunctionPass *llvm::createSIPrepareScratchRegs() {
> >+  return new SIPrepareScratchRegs();
> >+}
> >+
> >+// FIXME: Insert waits listed in Table 4.2 "Required User-Inserted Wait States"
> >+// around other non-memory instructions.
> Comment copied from other pass?
> 
> >+bool SIPrepareScratchRegs::runOnMachineFunction(MachineFunction &MF) {
> >+  SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();
> >+  const SIInstrInfo *TII =
> >+      static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());
> >+  const SIRegisterInfo *TRI = &TII->getRegisterInfo();
> >+  MachineRegisterInfo &MRI = MF.getRegInfo();
> >+  MachineFrameInfo *FrameInfo = MF.getFrameInfo();
> >+  MachineBasicBlock *Entry = MF.begin();
> >+  MachineBasicBlock::iterator I = Entry->begin();
> >+  DebugLoc DL = I->getDebugLoc();
> >+
> >+  // FIXME: If we don't have enough VGPRs for SGPR spilling we will need to do
> >+  // run this pass.
> Grammar: "we will need to do"
> >+  if (!MFI->hasSpilledVGPRs())
> >+    return false;
> >+
> >+  unsigned ScratchPtrPreloadReg =
> >+      TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_PTR);
> >+  unsigned ScratchOffsetPreloadReg =
> >+      TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_WAVE_OFFSET);
> >+
> >+  if (!Entry->isLiveIn(ScratchPtrPreloadReg))
> >+    Entry->addLiveIn(ScratchPtrPreloadReg);
> >+
> >+  if (!Entry->isLiveIn(ScratchOffsetPreloadReg))
> >+    Entry->addLiveIn(ScratchOffsetPreloadReg);
> >+
> >+  // Load the scratch pointer
> >+  unsigned ScratchPtrReg =
> >+      TRI->findUnusedRegister(MRI, &AMDGPU::SGPR_64RegClass);
> >+  int ScratchPtrFI = ~0;
> Initialize to -1 since it's signed
> >+
> >+  if (ScratchPtrReg != AMDGPU::NoRegister) {
> >+    // Found a SGPR to use.
> Grammar: an SGPR
> >+    MRI.setPhysRegUsed(ScratchPtrReg);
> >+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::S_MOV_B64), ScratchPtrReg)
> >+            .addReg(ScratchPtrPreloadReg);
> >+  } else {
> >+    // No SGPR is available, we must spill.
> >+    ScratchPtrFI = FrameInfo->CreateSpillStackObject(8, 4);
> >+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::SI_SPILL_S64_SAVE))
> >+            .addReg(ScratchPtrPreloadReg)
> >+            .addFrameIndex(ScratchPtrFI);
> >+  }
> >+
> >+  // load the scratch offset
> Capitalize / period comment
> >+  unsigned ScratchOffsetReg =
> >+      TRI->findUnusedRegister(MRI, &AMDGPU::SGPR_32RegClass);
> >+  int ScratchOffsetFI = ~0;
> >+
> >+  if (ScratchOffsetReg != AMDGPU::NoRegister) {
> >+    // Found an SGPR to use
> >+    MRI.setPhysRegUsed(ScratchOffsetReg);
> >+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::S_MOV_B32), ScratchOffsetReg)
> >+            .addReg(ScratchOffsetPreloadReg);
> >+  } else {
> >+    // No SGPR is available, we must spill.
> >+    ScratchOffsetFI = FrameInfo->CreateSpillStackObject(4,4);
> >+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::SI_SPILL_S32_SAVE))
> >+            .addReg(ScratchOffsetPreloadReg)
> >+            .addFrameIndex(ScratchOffsetFI);
> >+  }
> >+
> >+
> >+  // Now that we have the scratch pointer and offset values, we need to
> >+  // add them to all the SI_SPILL_V* instructions.
> >+
> >+  RegScavenger RS;
> >+  bool UseRegScavenger =
> >+      (ScratchPtrReg == AMDGPU::NoRegister ||
> >+      ScratchOffsetReg == AMDGPU::NoRegister);
> >+  for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();
> >+       BI != BE; ++BI) {
> >+
> >+    MachineBasicBlock &MBB = *BI;
> >+    if (UseRegScavenger)
> >+      RS.enterBasicBlock(&MBB);
> >+
> >+    for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();
> >+         I != E; ++I) {
> >+      MachineInstr &MI = *I;
> >+      DebugLoc DL = MI.getDebugLoc();
> >+      switch(MI.getOpcode()) {
> >+        default: break;;
> >+        case AMDGPU::SI_SPILL_V512_SAVE:
> >+        case AMDGPU::SI_SPILL_V256_SAVE:
> >+        case AMDGPU::SI_SPILL_V128_SAVE:
> >+        case AMDGPU::SI_SPILL_V96_SAVE:
> >+        case AMDGPU::SI_SPILL_V64_SAVE:
> >+        case AMDGPU::SI_SPILL_V32_SAVE:
> >+        case AMDGPU::SI_SPILL_V32_RESTORE:
> >+        case AMDGPU::SI_SPILL_V64_RESTORE:
> >+        case AMDGPU::SI_SPILL_V128_RESTORE:
> >+        case AMDGPU::SI_SPILL_V256_RESTORE:
> >+        case AMDGPU::SI_SPILL_V512_RESTORE:
> >+
> >+          // Scratch Pointer
> >+          if (ScratchPtrReg == AMDGPU::NoRegister) {
> >+            ScratchPtrReg = RS.scavengeRegister(&AMDGPU::SGPR_64RegClass, 0);
> >+            BuildMI(MBB, I, DL, TII->get(AMDGPU::SI_SPILL_S64_RESTORE),
> >+                    ScratchPtrReg)
> >+                    .addFrameIndex(ScratchPtrFI)
> >+                    .addReg(AMDGPU::NoRegister)
> >+                    .addReg(AMDGPU::NoRegister);
> >+          } else if (!MBB.isLiveIn(ScratchPtrReg)) {
> >+            MBB.addLiveIn(ScratchPtrReg);
> >+          }
> >+
> >+          if (ScratchOffsetReg == AMDGPU::NoRegister) {
> >+            ScratchOffsetReg = RS.scavengeRegister(&AMDGPU::SGPR_32RegClass, 0);
> >+            BuildMI(MBB, I, DL, TII->get(AMDGPU::SI_SPILL_S32_RESTORE),
> >+                    ScratchOffsetReg)
> >+                    .addFrameIndex(ScratchOffsetFI)
> >+                    .addReg(AMDGPU::NoRegister)
> >+                    .addReg(AMDGPU::NoRegister);
> >+          } else if (!MBB.isLiveIn(ScratchOffsetReg)) {
> >+            MBB.addLiveIn(ScratchOffsetReg);
> >+          }
> >+
> >+          if (ScratchPtrReg == AMDGPU::NoRegister ||
> >+              ScratchOffsetReg == AMDGPU::NoRegister) {
> >+            LLVMContext &Ctx = MF.getFunction()->getContext();
> >+            Ctx.emitError("Ran out of SGPRs for spilling VGPRs");
> >+            ScratchPtrReg = AMDGPU::SGPR0;
> >+            ScratchOffsetReg = AMDGPU::SGPR0;
> >+          }
> >+          MI.getOperand(2).setReg(ScratchPtrReg);
> >+          MI.getOperand(3).setReg(ScratchOffsetReg);
> >+
> >+          break;
> >+      }
> >+      if (UseRegScavenger)
> >+        RS.forward();
> >+    }
> >+  }
> >+  return true;
> >+}
> >diff --git a/lib/Target/R600/SIRegisterInfo.cpp b/lib/Target/R600/SIRegisterInfo.cpp
> >index cffea12..27abe9a 100644
> >--- a/lib/Target/R600/SIRegisterInfo.cpp
> >+++ b/lib/Target/R600/SIRegisterInfo.cpp
> >@@ -23,6 +23,7 @@
> >  #include "llvm/IR/Function.h"
> >  #include "llvm/IR/LLVMContext.h"
> >+#include "llvm/Support/Debug.h"
> >  using namespace llvm;
> >  SIRegisterInfo::SIRegisterInfo(const AMDGPUSubtarget &st)
> >@@ -92,6 +93,84 @@ static unsigned getNumSubRegsForSpillOp(unsigned Op) {
> >    }
> >  }
> >+void SIRegisterInfo::buildScratchLoadStore(MachineBasicBlock::iterator MI,
> >+                                           unsigned LoadStoreOp,
> >+                                           unsigned Value,
> >+                                           unsigned ScratchPtr,
> >+                                           unsigned ScratchOffset,
> >+                                           int64_t Offset,
> >+                                           RegScavenger *RS) const {
> >+
> >+  const SIInstrInfo*TII = static_cast<const SIInstrInfo*>(ST.getInstrInfo());
> >+  MachineBasicBlock *MBB = MI->getParent();
> >+  const MachineFunction *MF = MI->getParent()->getParent();
> >+  LLVMContext &Ctx = MF->getFunction()->getContext();
> >+  DebugLoc DL = MI->getDebugLoc();
> >+  bool IsLoad = TII->get(LoadStoreOp).mayLoad();
> >+
> >+  bool RanOutOfSGPRs = false;
> >+  unsigned SOffset = ScratchOffset;
> >+
> >+  unsigned RsrcReg = RS->scavengeRegister(&AMDGPU::SReg_128RegClass, MI, 0);
> >+  if (RsrcReg == AMDGPU::NoRegister) {
> >+    RanOutOfSGPRs = true;
> >+    RsrcReg = AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3;
> >+  }
> >+
> >+  unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());
> >+  unsigned Size = NumSubRegs * 4;
> >+
> >+  uint64_t Rsrc = AMDGPU::RSRC_DATA_FORMAT | AMDGPU::RSRC_TID_ENABLE |
> >+                  0xffffffff; // Size
> >+
> >+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B64),
> >+          getSubReg(RsrcReg, AMDGPU::sub0_sub1))
> >+          .addReg(ScratchPtr)
> >+          .addReg(RsrcReg, RegState::ImplicitDefine);
> >+
> >+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32),
> >+          getSubReg(RsrcReg, AMDGPU::sub2))
> >+          .addImm(Rsrc & 0xffffffff)
> >+          .addReg(RsrcReg, RegState::ImplicitDefine);
> >+
> >+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32),
> >+          getSubReg(RsrcReg, AMDGPU::sub3))
> >+          .addImm(Rsrc >> 32)
> >+          .addReg(RsrcReg, RegState::ImplicitDefine);
> >+
> >+  if (!isUInt<12>(Offset + Size)) {
> >+    SOffset = RS->scavengeRegister(&AMDGPU::SGPR_32RegClass, MI, 0);
> >+    if (SOffset == AMDGPU::NoRegister) {
> >+      RanOutOfSGPRs = true;
> >+      SOffset = AMDGPU::SGPR0;
> >+    }
> >+    BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), SOffset)
> >+            .addReg(ScratchOffset)
> >+            .addImm(Offset);
> >+    Offset = 0;
> >+  }
> >+
> >+  if (RanOutOfSGPRs)
> >+    Ctx.emitError("Ran out of SGPRs for spilling VGPRS");
> Errors should be lowercased
> >+
> >+  for (unsigned i = 0, e = NumSubRegs; i != e; ++i, Offset += 4) {
> >+    unsigned SubReg = NumSubRegs > 1 ?
> >+        getPhysRegSubReg(Value, &AMDGPU::VGPR_32RegClass, i) :
> >+        Value;
> >+    bool IsKill = (i == e - 1);
> >+
> >+    BuildMI(*MBB, MI, DL, TII->get(LoadStoreOp))
> >+            .addReg(SubReg, getDefRegState(IsLoad))
> >+            .addReg(RsrcReg, getKillRegState(IsKill))
> >+            .addImm(Offset)
> >+            .addReg(SOffset, getKillRegState(IsKill))
> >+            .addImm(0) // glc
> >+            .addImm(0) // slc
> >+            .addImm(0) // tfe
> >+            .addReg(Value, RegState::Implicit | getDefRegState(IsLoad));
> >+  }
> >+}
> >+
> >  void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
> >                                          int SPAdj, unsigned FIOperandNum,
> >                                          RegScavenger *RS) const {
> >@@ -160,7 +239,8 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
> >          BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
> >                  .addReg(Spill.VGPR)
> >-                .addImm(Spill.Lane);
> >+                .addImm(Spill.Lane)
> >+                .addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine);
> >          if (isM0) {
> >            BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
> >                    .addReg(SubReg);
> >@@ -177,71 +257,24 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
> >      case AMDGPU::SI_SPILL_V128_SAVE:
> >      case AMDGPU::SI_SPILL_V96_SAVE:
> >      case AMDGPU::SI_SPILL_V64_SAVE:
> >-    case AMDGPU::SI_SPILL_V32_SAVE: {
> >-      unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());
> >-      unsigned SrcReg = MI->getOperand(0).getReg();
> >-      int64_t Offset = FrameInfo->getObjectOffset(Index);
> >-      unsigned Size = NumSubRegs * 4;
> >-      unsigned TmpReg = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);
> >-
> >-      for (unsigned i = 0, e = NumSubRegs; i != e; ++i) {
> >-        unsigned SubReg = NumSubRegs > 1 ?
> >-            getPhysRegSubReg(SrcReg, &AMDGPU::VGPR_32RegClass, i) :
> >-            SrcReg;
> >-        Offset += (i * 4);
> >-        MFI->LDSWaveSpillSize = std::max((unsigned)Offset + 4, (unsigned)MFI->LDSWaveSpillSize);
> >-
> >-        unsigned AddrReg = TII->calculateLDSSpillAddress(*MBB, MI, RS, TmpReg,
> >-                                                         Offset, Size);
> >-
> >-        if (AddrReg == AMDGPU::NoRegister) {
> >-           LLVMContext &Ctx = MF->getFunction()->getContext();
> >-           Ctx.emitError("Ran out of VGPRs for spilling VGPRS");
> >-           AddrReg = AMDGPU::VGPR0;
> >-        }
> >-
> >-        // Store the value in LDS
> >-        BuildMI(*MBB, MI, DL, TII->get(AMDGPU::DS_WRITE_B32))
> >-                .addImm(0) // gds
> >-                .addReg(AddrReg, RegState::Kill) // addr
> >-                .addReg(SubReg) // data0
> >-                .addImm(0); // offset
> >-      }
> >-
> >+    case AMDGPU::SI_SPILL_V32_SAVE:
> >+      buildScratchLoadStore(MI, AMDGPU::BUFFER_STORE_DWORD_OFFSET,
> >+                            MI->getOperand(0).getReg(),
> >+                            MI->getOperand(2).getReg(),
> >+                            MI->getOperand(3).getReg(),
> >+                            FrameInfo->getObjectOffset(Index), RS);
> >        MI->eraseFromParent();
> >        break;
> >-    }
> >      case AMDGPU::SI_SPILL_V32_RESTORE:
> >      case AMDGPU::SI_SPILL_V64_RESTORE:
> >      case AMDGPU::SI_SPILL_V128_RESTORE:
> >      case AMDGPU::SI_SPILL_V256_RESTORE:
> >      case AMDGPU::SI_SPILL_V512_RESTORE: {
> >-      unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());
> >-      unsigned DstReg = MI->getOperand(0).getReg();
> >-      int64_t Offset = FrameInfo->getObjectOffset(Index);
> >-      unsigned Size = NumSubRegs * 4;
> >-      unsigned TmpReg = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);
> >-
> >-      // FIXME: We could use DS_READ_B64 here to optimize for larger registers.
> >-      for (unsigned i = 0, e = NumSubRegs; i != e; ++i) {
> >-        unsigned SubReg = NumSubRegs > 1 ?
> >-            getPhysRegSubReg(DstReg, &AMDGPU::VGPR_32RegClass, i) :
> >-            DstReg;
> >-
> >-        Offset += (i * 4);
> >-        unsigned AddrReg = TII->calculateLDSSpillAddress(*MBB, MI, RS, TmpReg,
> >-                                                          Offset, Size);
> >-        if (AddrReg == AMDGPU::NoRegister) {
> >-           LLVMContext &Ctx = MF->getFunction()->getContext();
> >-           Ctx.emitError("Ran out of VGPRs for spilling VGPRs");
> >-           AddrReg = AMDGPU::VGPR0;
> >-        }
> >-
> >-        BuildMI(*MBB, MI, DL, TII->get(AMDGPU::DS_READ_B32), SubReg)
> >-                .addImm(0) // gds
> >-                .addReg(AddrReg, RegState::Kill) // addr
> >-                .addImm(0); //offset
> >-      }
> >+      buildScratchLoadStore(MI, AMDGPU::BUFFER_LOAD_DWORD_OFFSET,
> >+                            MI->getOperand(0).getReg(),
> >+                            MI->getOperand(2).getReg(),
> >+                            MI->getOperand(3).getReg(),
> >+                            FrameInfo->getObjectOffset(Index), RS);
> Use named operands?
> >        MI->eraseFromParent();
> >        break;
> >      }
> >@@ -452,9 +485,8 @@ unsigned SIRegisterInfo::getPreloadedValue(const MachineFunction &MF,
> >  /// \brief Returns a register that is not used at any point in the function.
> >  ///        If all registers are used, then this function will return
> >  //         AMDGPU::NoRegister.
> >-unsigned SIRegisterInfo::findUnusedVGPR(const MachineRegisterInfo &MRI) const {
> >-
> >-  const TargetRegisterClass *RC = &AMDGPU::VGPR_32RegClass;
> >+unsigned SIRegisterInfo::findUnusedRegister(const MachineRegisterInfo &MRI,
> >+                                           const TargetRegisterClass *RC) const {
> >    for (TargetRegisterClass::iterator I = RC->begin(), E = RC->end();
> >         I != E; ++I) {
> >diff --git a/lib/Target/R600/SIRegisterInfo.h b/lib/Target/R600/SIRegisterInfo.h
> >index c7e54db..f1d78b4 100644
> >--- a/lib/Target/R600/SIRegisterInfo.h
> >+++ b/lib/Target/R600/SIRegisterInfo.h
> >@@ -113,7 +113,14 @@ struct SIRegisterInfo : public AMDGPURegisterInfo {
> >    unsigned getPreloadedValue(const MachineFunction &MF,
> >                               enum PreloadedValue Value) const;
> >-  unsigned findUnusedVGPR(const MachineRegisterInfo &MRI) const;
> >+  unsigned findUnusedRegister(const MachineRegisterInfo &MRI,
> >+                              const TargetRegisterClass *RC) const;
> >+
> >+private:
> >+  void buildScratchLoadStore(MachineBasicBlock::iterator MI,
> >+                             unsigned LoadStoreOp, unsigned Value,
> >+                             unsigned ScratchPtr, unsigned ScratchOffset,
> >+                             int64_t Offset, RegScavenger *RS) const;
> >  };
> >  } // End namespace llvm
> >-- 2.0.4
> >
> >0005-MISched-Fix-moving-stores-across-barriers.patch
> >
> >
> > From c151bc73cdf7e1f40ec90c2d8dbc93ae34673890 Mon Sep 17 00:00:00 2001
> >From: Tom Stellard<thomas.stellard at amd.com>
> >Date: Tue, 2 Dec 2014 17:26:08 +0000
> >Subject: [PATCH 5/6] MISched: Fix moving stores across barriers
> >
> >This fixes an issue with ScheduleDAGInstrs::buildSchedGraph
> >where stores without an underlying object would not be added
> >as a predecessor to the current BarrierChain.
> >---
> >  lib/CodeGen/ScheduleDAGInstrs.cpp  |  9 +++++--
> >  test/CodeGen/R600/store-barrier.ll | 52 ++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 59 insertions(+), 2 deletions(-)
> >  create mode 100644 test/CodeGen/R600/store-barrier.ll
> >
> >diff --git a/lib/CodeGen/ScheduleDAGInstrs.cpp b/lib/CodeGen/ScheduleDAGInstrs.cpp
> >index d8d8422..ee8b5c2 100644
> >--- a/lib/CodeGen/ScheduleDAGInstrs.cpp
> >+++ b/lib/CodeGen/ScheduleDAGInstrs.cpp
> >@@ -794,6 +794,7 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
> >    for (MachineBasicBlock::iterator MII = RegionEnd, MIE = RegionBegin;
> >         MII != MIE; --MII) {
> >      MachineInstr *MI = std::prev(MII);
> >+
> >      if (MI && DbgMI) {
> >        DbgValues.push_back(std::make_pair(DbgMI, MI));
> >        DbgMI = nullptr;
> >@@ -920,6 +921,12 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
> >        AliasMemDefs.clear();
> >        AliasMemUses.clear();
> >      } else if (MI->mayStore()) {
> >+      // Add dependence on barrier chain, if needed.
> >+      // There is no point to check aliasing on barrier event. Even if
> >+      // SU and barrier_could_  be reordered, they should not. In addition,
> >+      // we have lost all RejectMemNodes below barrier.
> >+      if (BarrierChain)
> >+        BarrierChain->addPred(SDep(SU, SDep::Barrier));
> >        UnderlyingObjectsVector Objs;
> >        getUnderlyingObjectsForInstr(MI, MFI, Objs);
> >@@ -993,8 +1000,6 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,
> >        // There is no point to check aliasing on barrier event. Even if
> >        // SU and barrier_could_  be reordered, they should not. In addition,
> >        // we have lost all RejectMemNodes below barrier.
> >-      if (BarrierChain)
> >-        BarrierChain->addPred(SDep(SU, SDep::Barrier));
> >      } else if (MI->mayLoad()) {
> >        bool MayAlias = true;
> >        if (MI->isInvariantLoad(AA)) {
> >diff --git a/test/CodeGen/R600/store-barrier.ll b/test/CodeGen/R600/store-barrier.ll
> >new file mode 100644
> >index 0000000..229cd8f
> >--- /dev/null
> >+++ b/test/CodeGen/R600/store-barrier.ll
> >@@ -0,0 +1,52 @@
> >+; RUN: llc -march=r600 -mcpu=SI -verify-machineinstrs -mattr=+load-store-opt -enable-misched < %s | FileCheck  --check-prefix=CHECK %s
> >+; RUN: llc -march=r600 -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt -enable-misched < %s | FileCheck  --check-prefix=CHECK %s
> >+
> >+; This test is for a bug in the machine scheduler where stores without
> >+; an underlying object would be moved across the barrier.  In this
> >+; test, the <2 x i8> store will be split into two i8 stores, so they
> >+; won't have an underlying object.
> >+
> >+; CHECK-LABEL: {{^}}test:
> >+; CHECK: ds_write_b8
> >+; CHECK: ds_write_b8
> >+; CHECK: s_barrier
> Should probably check something after the barrier also
> 
> >+; Function Attrs: nounwind
> >+define void @test(<2 x i8> addrspace(3)* nocapture %arg, <2 x i8> addrspace(1)* nocapture readonly %arg1, i32 addrspace(1)* nocapture readonly %arg2, <2 x i8> addrspace(1)* nocapture %arg3, i32 %arg4, i64 %tmp9) #0 {
> >+bb:
> >+  %tmp10 = getelementptr inbounds i32 addrspace(1)* %arg2, i64 %tmp9
> >+  %tmp13 = load i32 addrspace(1)* %tmp10, align 2
> >+  %tmp14 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp13
> >+  %tmp15 = load <2 x i8> addrspace(3)* %tmp14, align 2
> >+  %tmp16 = add i32 %tmp13, 1
> >+  %tmp17 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp16
> >+  store <2 x i8> %tmp15, <2 x i8> addrspace(3)* %tmp17, align 2
> >+  tail call void @llvm.AMDGPU.barrier.local() #2
> >+  %tmp25 = load i32 addrspace(1)* %tmp10, align 4
> >+  %tmp26 = sext i32 %tmp25 to i64
> >+  %tmp27 = sext i32 %arg4 to i64
> >+  %tmp28 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp25, i32 %arg4
> >+  %tmp29 = load i8 addrspace(3)* %tmp28, align 1
> >+  %tmp30 = getelementptr inbounds <2 x i8> addrspace(1)* %arg3, i64 %tmp26, i64 %tmp27
> >+  store i8 %tmp29, i8 addrspace(1)* %tmp30, align 1
> >+  %tmp32 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp25, i32 0
> >+  %tmp33 = load i8 addrspace(3)* %tmp32, align 1
> >+  %tmp35 = getelementptr inbounds <2 x i8> addrspace(1)* %arg3, i64 %tmp26, i64 0
> >+  store i8 %tmp33, i8 addrspace(1)* %tmp35, align 1
> >+  ret void
> >+}
> >+
> >+; Function Attrs: noduplicate nounwind
> >+declare void @llvm.AMDGPU.barrier.local() #2
> >+
> >+attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }
> >+attributes #1 = { nounwind readnone }
> >+attributes #2 = { noduplicate nounwind }
> >+
> >+!opencl.kernels = !{!0}
> >+
> >+!0 = metadata !{void (<2 x i8> addrspace(3)*, <2 x i8> addrspace(1)*, i32 addrspace(1)*, <2 x i8> addrspace(1)*, i32, i64)* @test}
> >+!3 = metadata !{metadata !4, metadata !4, i64 0}
> >+!4 = metadata !{metadata !"int", metadata !5, i64 0}
> >+!5 = metadata !{metadata !"omnipotent char", metadata !6, i64 0}
> >+!6 = metadata !{metadata !"Simple C/C++ TBAA"}
> >+!7 = metadata !{metadata !5, metadata !5, i64 0}
> >-- 2.0.4
> >
> >0006-R600-SI-Define-a-schedule-model-and-enable-the-gener.patch
> >
> >
> > From db8ef9632e44dc76668758fd83981057e3bcfac1 Mon Sep 17 00:00:00 2001
> >From: Tom Stellard<thomas.stellard at amd.com>
> >Date: Fri, 19 Jul 2013 11:50:00 -0700
> >Subject: [PATCH 6/6] R600/SI: Define a schedule model and enable the generic
> >  machine scheduler
> >
> >The schedule model is not complete yet, and could be improved.
> >---
> >  lib/Target/R600/AMDGPUSubtarget.cpp              | 14 ++++-
> >  lib/Target/R600/AMDGPUSubtarget.h                |  6 +-
> >  lib/Target/R600/Processors.td                    | 24 ++++----
> >  lib/Target/R600/SIInstrFormats.td                | 10 +++-
> >  lib/Target/R600/SIInstructions.td                | 48 ++++++++++++++-
> >  lib/Target/R600/SIRegisterInfo.cpp               | 54 ++++++++++++++++-
> >  lib/Target/R600/SIRegisterInfo.h                 | 12 +++-
> >  lib/Target/R600/SISchedule.td                    | 76 +++++++++++++++++++++++-
> >  test/CodeGen/R600/atomic_cmp_swap_local.ll       |  6 +-
> >  test/CodeGen/R600/ctpop.ll                       |  4 +-
> >  test/CodeGen/R600/ds_read2st64.ll                |  4 +-
> >  test/CodeGen/R600/fceil64.ll                     |  4 +-
> >  test/CodeGen/R600/ffloor.ll                      |  4 +-
> >  test/CodeGen/R600/fmax3.ll                       |  6 +-
> >  test/CodeGen/R600/fmin3.ll                       |  6 +-
> >  test/CodeGen/R600/fneg-fabs.f64.ll               |  2 +-
> >  test/CodeGen/R600/ftrunc.f64.ll                  |  4 +-
> >  test/CodeGen/R600/llvm.memcpy.ll                 | 34 +++++------
> >  test/CodeGen/R600/local-atomics.ll               |  4 +-
> >  test/CodeGen/R600/local-atomics64.ll             |  2 +-
> >  test/CodeGen/R600/local-memory-two-objects.ll    |  4 +-
> >  test/CodeGen/R600/si-triv-disjoint-mem-access.ll |  2 +-
> >  test/CodeGen/R600/smrd.ll                        | 10 ++--
> >  test/CodeGen/R600/wait.ll                        |  5 +-
> >  test/CodeGen/R600/zero_extend.ll                 |  2 +-
> >  25 files changed, 271 insertions(+), 76 deletions(-)
> >
> >diff --git a/lib/Target/R600/AMDGPUSubtarget.cpp b/lib/Target/R600/AMDGPUSubtarget.cpp
> >index 9d09a19..5a3785f 100644
> >--- a/lib/Target/R600/AMDGPUSubtarget.cpp
> >+++ b/lib/Target/R600/AMDGPUSubtarget.cpp
> >@@ -19,8 +19,7 @@
> >  #include "SIInstrInfo.h"
> >  #include "SIISelLowering.h"
> >  #include "llvm/ADT/SmallString.h"
> >-
> >-#include "llvm/ADT/SmallString.h"
> >+#include "llvm/CodeGen/MachineScheduler.h"
> >  using namespace llvm;
> >@@ -107,3 +106,14 @@ unsigned AMDGPUSubtarget::getStackEntrySize() const {
> >      llvm_unreachable("Illegal wavefront size.");
> >    }
> >  }
> >+
> >+void AMDGPUSubtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,
> >+                                          MachineInstr *begin,
> >+                                          MachineInstr *end,
> >+                                          unsigned NumRegionInstrs) const {
> >+  if (getGeneration() >= SOUTHERN_ISLANDS) {
> >+    Policy.ShouldTrackPressure = true;;
> Double ;
> >+    Policy.OnlyTopDown = false;
> >+    Policy.OnlyBottomUp = false;
> Is there a reason for selecting this? There should be a comment for
> why the policy is what it is
> 
> >+  }
> >+}
> >diff --git a/lib/Target/R600/AMDGPUSubtarget.h b/lib/Target/R600/AMDGPUSubtarget.h
> >index f71d80a..3e44c66 100644
> >--- a/lib/Target/R600/AMDGPUSubtarget.h
> >+++ b/lib/Target/R600/AMDGPUSubtarget.h
> >@@ -199,9 +199,13 @@ public:
> >    }
> >    bool enableMachineScheduler() const override {
> >-    return getGeneration() <= NORTHERN_ISLANDS;
> >+    return true;
> >    }
> >+  void overrideSchedPolicy(MachineSchedPolicy &Policy,
> >+                           MachineInstr *begin, MachineInstr *end,
> >+                           unsigned NumRegionInstrs) const override;
> >+
> >    // Helper functions to simplify if statements
> >    bool isTargetELF() const {
> >      return false;
> >diff --git a/lib/Target/R600/Processors.td b/lib/Target/R600/Processors.td
> >index ce17d7c..17422f9 100644
> >--- a/lib/Target/R600/Processors.td
> >+++ b/lib/Target/R600/Processors.td
> >@@ -83,28 +83,30 @@ def : Proc<"cayman",     R600_VLIW4_Itin,
> >  // Southern Islands
> >  //===----------------------------------------------------------------------===//
> >-def : Proc<"SI",         SI_Itin, [FeatureSouthernIslands]>;
> >+// FIXME: Which of these should use the half speed?
> I believe this can be different for different versions of tahiti
> also. This should probably be a subtarget feature settable by the
> driver.
> 
> >-def : Proc<"tahiti",     SI_Itin, [FeatureSouthernIslands]>;
> >+def : ProcessorModel<"SI",         SIFullSpeedModel, [FeatureSouthernIslands]>;
> >-def : Proc<"pitcairn",   SI_Itin, [FeatureSouthernIslands]>;
> >+def : ProcessorModel<"tahiti",     SIFullSpeedModel, [FeatureSouthernIslands]>;
> >-def : Proc<"verde",      SI_Itin, [FeatureSouthernIslands]>;
> >+def : ProcessorModel<"pitcairn",   SIFullSpeedModel, [FeatureSouthernIslands]>;
> verde, bonaire and pitcairn are all 1/16th FP64
> >-def : Proc<"oland",      SI_Itin, [FeatureSouthernIslands]>;
> >+def : ProcessorModel<"verde",      SIFullSpeedModel, [FeatureSouthernIslands]>;
> >-def : Proc<"hainan",     SI_Itin, [FeatureSouthernIslands]>;
> >+def : ProcessorModel<"oland",      SIFullSpeedModel, [FeatureSouthernIslands]>;
> >+
> >+def : ProcessorModel<"hainan",     SIFullSpeedModel, [FeatureSouthernIslands]>;
> >  //===----------------------------------------------------------------------===//
> >  // Sea Islands
> >  //===----------------------------------------------------------------------===//
> >-def : Proc<"bonaire",    SI_Itin, [FeatureSeaIslands]>;
> >+def : ProcessorModel<"bonaire",    SIFullSpeedModel, [FeatureSeaIslands]>;
> >-def : Proc<"kabini",     SI_Itin, [FeatureSeaIslands]>;
> >+def : ProcessorModel<"kabini",     SIFullSpeedModel, [FeatureSeaIslands]>;
> >-def : Proc<"kaveri",     SI_Itin, [FeatureSeaIslands]>;
> >+def : ProcessorModel<"kaveri",     SIFullSpeedModel, [FeatureSeaIslands]>;
> >-def : Proc<"hawaii",     SI_Itin, [FeatureSeaIslands]>;
> >+def : ProcessorModel<"hawaii",     SIFullSpeedModel, [FeatureSeaIslands]>;
> Hawaii is 1/8th FP64, but 1/4th on the workstation versions
> >-def : Proc<"mullins",    SI_Itin, [FeatureSeaIslands]>;
> >+def : ProcessorModel<"mullins",    SIFullSpeedModel, [FeatureSeaIslands]>;
> >diff --git a/lib/Target/R600/SIInstrFormats.td b/lib/Target/R600/SIInstrFormats.td
> >index ee1a52b..4b688e0 100644
> >--- a/lib/Target/R600/SIInstrFormats.td
> >+++ b/lib/Target/R600/SIInstrFormats.td
> >@@ -46,6 +46,7 @@ class InstSI <dag outs, dag ins, string asm, list<dag> pattern> :
> >    // Most instructions require adjustments after selection to satisfy
> >    // operand requirements.
> >    let hasPostISelHook = 1;
> >+  let SchedRW = [Write32Bit];
> >  }
> >  class Enc32 {
> >@@ -161,6 +162,8 @@ class SMRDe <bits<5> op, bits<1> imm> : Enc32 {
> >    let Inst{31-27} = 0x18; //encoding
> >  }
> >+let SchedRW = [WriteSALU] in {
> >+
> >  class SOP1 <bits<8> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >      InstSI<outs, ins, asm, pattern>, SOP1e <op> {
> >@@ -216,6 +219,8 @@ class SOPP <bits<7> op, dag ins, string asm, list<dag> pattern = []> :
> >    let UseNamedOperandTable = 1;
> >  }
> >+} // let SchedRW = [WriteSALU]
> >+
> >  class SMRD <dag outs, dag ins, string asm, list<dag> pattern> :
> >      InstSI<outs, ins, asm, pattern> {
> >@@ -225,6 +230,7 @@ class SMRD <dag outs, dag ins, string asm, list<dag> pattern> :
> >    let mayLoad = 1;
> >    let hasSideEffects = 0;
> >    let UseNamedOperandTable = 1;
> >+  let SchedRW = [WriteSMEM];
> >  }
> >  //===----------------------------------------------------------------------===//
> >@@ -547,6 +553,7 @@ class DS <bits<8> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >    let LGKM_CNT = 1;
> >    let UseNamedOperandTable = 1;
> >    let DisableEncoding = "$m0";
> >+  let SchedRW = [WriteLDS];
> >  }
> >  class MUBUF <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >@@ -558,6 +565,7 @@ class MUBUF <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >    let hasSideEffects = 0;
> >    let UseNamedOperandTable = 1;
> >+  let SchedRW = [WriteVMEM];
> >  }
> >  class MTBUF <dag outs, dag ins, string asm, list<dag> pattern> :
> >@@ -569,6 +577,7 @@ class MTBUF <dag outs, dag ins, string asm, list<dag> pattern> :
> >    let neverHasSideEffects = 1;
> >    let UseNamedOperandTable = 1;
> >+  let SchedRW = [WriteVMEM];
> >  }
> >  class FLAT <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >@@ -597,5 +606,4 @@ class MIMG <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :
> >  }
> >-
> >  } // End Uses = [EXEC]
> >diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> >index 3a969e7..fd860e5 100644
> >--- a/lib/Target/R600/SIInstructions.td
> >+++ b/lib/Target/R600/SIInstructions.td
> >@@ -1160,6 +1160,8 @@ defm V_MOV_B32 : VOP1Inst <vop1<0x1>, "v_mov_b32", VOP_I32_I32>;
> >  let Uses = [EXEC] in {
> >+// FIXME: Specify SchedRW for READFIRSTLANE+B32
> Comment typo in instruction name +
> >+
> >  def V_READFIRSTLANE_B32 : VOP1 <
> >    0x00000002,
> >    (outs SReg_32:$vdst),
> >@@ -1170,6 +1172,8 @@ def V_READFIRSTLANE_B32 : VOP1 <
> >  }
> >+let SchedRW = [WriteConversion] in {
> >+
> >  defm V_CVT_I32_F64 : VOP1Inst <vop1<0x3>, "v_cvt_i32_f64",
> >    VOP_I32_F64, fp_to_sint
> >  >;
> >@@ -1223,6 +1227,8 @@ defm V_CVT_F64_U32 : VOP1Inst <vop1<0x16>, "v_cvt_f64_u32",
> >    VOP_F64_I32, uint_to_fp
> >  >;
> >+} // let SchedRW = [WriteConversion]
> >+
> >  defm V_FRACT_F32 : VOP1Inst <vop1<0x20>, "v_fract_f32",
> >    VOP_F32_F32, AMDGPUfract
> >  >;
> >@@ -1241,6 +1247,9 @@ defm V_FLOOR_F32 : VOP1Inst <vop1<0x24>, "v_floor_f32",
> >  defm V_EXP_F32 : VOP1Inst <vop1<0x25>, "v_exp_f32",
> >    VOP_F32_F32, fexp2
> >  >;
> >+
> >+let SchedRW = [WriteFloatTrans] in {
> I don't think WriteFloatTrans or some of these others are useful
> ways to characterize the instructions. Are these standard
> classifications the generic scheduler uses? More useful would be
> something like QuarterRate32

I kept these separate, because I wasn't sure if the latencies were consistent for
all speeds.  For example on single speed, FMA and double operations both take
16 cycles, but on full speed devices, FMA is 1 and Double is 4.  I thought maybe
in the future I would discover that other classes, like WriteFloatTrans, would
have these same differences too.  

I don't mind merging WriteIntMul, WriteFloatFMA, and WriteConversion together.

-Tom