<html>

  <head>

    <meta content="text/html; charset=windows-1252"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 12/02/2014 04:47 PM, Tom Stellard

      wrote:<br>

    </div>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">Hi,

The attached patches enable the machine scheduler on SI subtargets as well as

enable register spilling for compute.  Enabling the scheduler improves performance

on the Unigine demos by 10% and Luxmark by 20%.

There is still a lot of room for improvement in the scheduler definitions and

the associated callbacks.  These patches just enable the scheduler and implement

a simple machine model.

-Tom

</pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0001-R600-SI-Don-t-run-SI-passes-on-R600-subtargets.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From d30579a680c62e5dcb2e2c9e02ee203cf4d7bba3 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Fri, 28 Nov 2014 14:38:48 +0000

Subject: [PATCH 1/6] R600/SI: Don't run SI passes on R600 subtargets

---

 lib/Target/R600/AMDGPUTargetMachine.cpp | 6 +++---

 1 file changed, 3 insertions(+), 3 deletions(-)</pre>

      </div>

    </blockquote>

    <br>

    LGTM<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0002-R600-SI-Move-SIInsertWaits-into-AMDGPUPassConfig-add.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From a25b877f29865facce0aff53295356e336bd2b0a Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Fri, 28 Nov 2014 14:34:24 +0000

Subject: [PATCH 2/6] R600/SI: Move SIInsertWaits into

 AMDGPUPassConfig::addPreSched2()

This pass needs to be run after PrologEpilogInserter, because

that pass may inserter spill code which reads or writes memory.

---

 lib/Target/R600/AMDGPUTargetMachine.cpp | 4 +++-

 1 file changed, 3 insertions(+), 1 deletion(-)</pre>

      </div>

    </blockquote>

    <br>

    LGTM

    <pre wrap=""><div class="moz-txt-sig">

</div></pre>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite"><br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0003-R600-SI-Set-MayStore-0-on-MUBUF-loads.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From c0688d069d3de09a75241ec2f0f0cc605320f3f8 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Mon, 24 Nov 2014 20:56:28 +0000

Subject: [PATCH 3/6] R600/SI: Set MayStore = 0 on MUBUF loads

---

 lib/Target/R600/SIInstrInfo.td | 2 +-

 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td

index 392c272..8cc84c9 100644

--- a/lib/Target/R600/SIInstrInfo.td

+++ b/lib/Target/R600/SIInstrInfo.td

@@ -1187,7 +1187,7 @@ multiclass MUBUF_Load_Helper <bits<7> op, string asm, RegisterClass regClass,

                               ValueType load_vt = i32,

                               SDPatternOperator ld = null_frag> {

-  let lds = 0, mayLoad = 1 in {

+  let lds = 0, mayLoad = 1, mayStore = 0 in {

     let addr64 = 0 in {

<div class="moz-txt-sig">-- 

2.0.4

</div></pre>

      </div>

    </blockquote>

    LGTM. Shouldn't this be getting set in the base MUBUF instruction

    class? Why do these need to be set again here?<br>

    <br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap=""><div class="moz-txt-sig">

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0004-R600-SI-Spill-VGPRs-to-scratch-space-for-compute-sha.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From 2688eabde67974f5720c9abc99777a97923c5e8e Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Thu, 27 Nov 2014 21:13:48 +0000

Subject: [PATCH 4/6] R600/SI: Spill VGPRs to scratch space for compute shaders

---

 lib/Target/R600/AMDGPU.h                  |   1 +

 lib/Target/R600/AMDGPUTargetMachine.cpp   |   1 +

 lib/Target/R600/CMakeLists.txt            |   1 +

 lib/Target/R600/SIInstrInfo.cpp           |  33 ++++-

 lib/Target/R600/SIInstrInfo.td            |   2 +

 lib/Target/R600/SIInstructions.td         |  10 +-

 lib/Target/R600/SIMachineFunctionInfo.cpp |   3 +-

 lib/Target/R600/SIMachineFunctionInfo.h   |   3 +

 lib/Target/R600/SIPrepareScratchRegs.cpp  | 198 ++++++++++++++++++++++++++++++

 lib/Target/R600/SIRegisterInfo.cpp        | 156 +++++++++++++----------

 lib/Target/R600/SIRegisterInfo.h          |   9 +-

 11 files changed, 344 insertions(+), 73 deletions(-)

 create mode 100644 lib/Target/R600/SIPrepareScratchRegs.cpp

diff --git a/lib/Target/R600/AMDGPU.h b/lib/Target/R600/AMDGPU.h

index 13379e7..6819808 100644

--- a/lib/Target/R600/AMDGPU.h

+++ b/lib/Target/R600/AMDGPU.h

@@ -47,6 +47,7 @@ FunctionPass *createSIFixSGPRCopiesPass(TargetMachine &tm);

 FunctionPass *createSIFixSGPRLiveRangesPass();

 FunctionPass *createSICodeEmitterPass(formatted_raw_ostream &OS);

 FunctionPass *createSIInsertWaits(TargetMachine &tm);

+FunctionPass *createSIPrepareScratchRegs();

 void initializeSIFoldOperandsPass(PassRegistry &);

 extern char &SIFoldOperandsID;

diff --git a/lib/Target/R600/AMDGPUTargetMachine.cpp b/lib/Target/R600/AMDGPUTargetMachine.cpp

index d4ee738..c035af0 100644

--- a/lib/Target/R600/AMDGPUTargetMachine.cpp

+++ b/lib/Target/R600/AMDGPUTargetMachine.cpp

@@ -190,6 +190,7 @@ bool AMDGPUPassConfig::addPostRegAlloc() {

   const AMDGPUSubtarget &ST = TM->getSubtarget<AMDGPUSubtarget>();

   if (ST.getGeneration() > AMDGPUSubtarget::NORTHERN_ISLANDS) {

+    addPass(createSIPrepareScratchRegs());

     addPass(createSIShrinkInstructionsPass());

   }

   return false;

diff --git a/lib/Target/R600/CMakeLists.txt b/lib/Target/R600/CMakeLists.txt

index 3b703e7..5a4bae2 100644

--- a/lib/Target/R600/CMakeLists.txt

+++ b/lib/Target/R600/CMakeLists.txt

@@ -51,6 +51,7 @@ add_llvm_target(R600CodeGen

   SILowerControlFlow.cpp

   SILowerI1Copies.cpp

   SIMachineFunctionInfo.cpp

+  SIPrepareScratchRegs.cpp

   SIRegisterInfo.cpp

   SIShrinkInstructions.cpp

   SITypeRewriter.cpp

diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp

index 1a0010c..acdb0fa 100644

--- a/lib/Target/R600/SIInstrInfo.cpp

+++ b/lib/Target/R600/SIInstrInfo.cpp

@@ -426,8 +426,7 @@ static bool shouldTryToSpillVGPRs(MachineFunction *MF) {

   // FIXME: Even though it can cause problems, we need to enable

   // spilling at -O0, since the fast register allocator always

   // spills registers that are live at the end of blocks.

-  return MFI->getShaderType() == ShaderType::COMPUTE &&

-         TM.getOptLevel() == CodeGenOpt::None;

+  return MFI->getShaderType() == ShaderType::COMPUTE;

 </pre>

      </div>

    </blockquote>

    I still don't think conditionally enabling spilling makes any sense.

    It just changes how it breaks for non-compute shaders.<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

 }

@@ -438,7 +437,9 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,

                                       const TargetRegisterClass *RC,

                                       const TargetRegisterInfo *TRI) const {

   MachineFunction *MF = MBB.getParent();

+  SIMachineFunctionInfo *MFI = MF->getInfo<SIMachineFunctionInfo>();

   MachineFrameInfo *FrameInfo = MF->getFrameInfo();

+  MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();

   DebugLoc DL = MBB.findDebugLoc(MI);

   int Opcode = -1;

@@ -454,6 +455,19 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,

       case 512: Opcode = AMDGPU::SI_SPILL_S512_SAVE; break;

     }

   } else if(shouldTryToSpillVGPRs(MF) && RI.hasVGPRs(RC)) {

+    MFI->setHasSpilledVGPRs();

+#if 0

+    unsigned ScratchPtr =

+        RI.getPreloadedValue(*MF, SIRegisterInfo::SCRATCH_PTR);

+    unsigned ScratchOffset =

+        RI.getPreloadedValue(*MF, SIRegisterInfo::SCRATCH_WAVE_OFFSET);

+    if (!MRI.isLiveIn(ScratchPtr))

+      MRI.addLiveIn(ScratchPtr);

+

+    if (!MRI.isLiveIn(ScratchOffset))

+      MRI.addLiveIn(ScratchOffset);

+#endif

+</pre>

      </div>

    </blockquote>

    Dead code<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

     switch(RC->getSize() * 8) {

       case 32: Opcode = AMDGPU::SI_SPILL_V32_SAVE; break;

       case 64: Opcode = AMDGPU::SI_SPILL_V64_SAVE; break;

@@ -468,7 +482,11 @@ void SIInstrInfo::storeRegToStackSlot(MachineBasicBlock &MBB,

     FrameInfo->setObjectAlignment(FrameIndex, 4);

     BuildMI(MBB, MI, DL, get(Opcode))

             .addReg(SrcReg)

-            .addFrameIndex(FrameIndex);

+            .addFrameIndex(FrameIndex)

+            // Place-holder registers, these will be filled in by

+            // SIPrepareScratchRegs.</pre>

      </div>

    </blockquote>

    Why do you need to reserve physical registers for this? Can you

    define virtual registers with IMPLICIT_DEF, or virtual registers

    with<br>

    a spill register class for the pseudo instructions?<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+            .addReg(AMDGPU::SGPR0_SGPR1, RegState::Undef)

+            .addReg(AMDGPU::SGPR0, RegState::Undef);

   } else {

     LLVMContext &Ctx = MF->getFunction()->getContext();

     Ctx.emitError("SIInstrInfo::storeRegToStackSlot - Do not know how to"

@@ -510,7 +528,12 @@ void SIInstrInfo::loadRegFromStackSlot(MachineBasicBlock &MBB,

   if (Opcode != -1) {

     FrameInfo->setObjectAlignment(FrameIndex, 4);

     BuildMI(MBB, MI, DL, get(Opcode), DestReg)

-            .addFrameIndex(FrameIndex);

+            .addFrameIndex(FrameIndex)

+            // Place-holder registers, these will be filled in by

+            // SIPrepareScratchRegs.

+            .addReg(AMDGPU::SGPR0_SGPR1, RegState::Undef)

+            .addReg(AMDGPU::SGPR0, RegState::Undef);

+

   } else {

     LLVMContext &Ctx = MF->getFunction()->getContext();

     Ctx.emitError("SIInstrInfo::loadRegFromStackSlot - Do not know how to"

@@ -541,7 +564,7 @@ unsigned SIInstrInfo::calculateLDSSpillAddress(MachineBasicBlock &MBB,

     MachineBasicBlock::iterator Insert = Entry.front();

     DebugLoc DL = Insert->getDebugLoc();

-    TIDReg = RI.findUnusedVGPR(MF->getRegInfo());

+    TIDReg = RI.findUnusedRegister(MF->getRegInfo(), &AMDGPU::VGPR_32RegClass);

     if (TIDReg == AMDGPU::NoRegister)

       return TIDReg;

diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td

index 8cc84c9..afdc200 100644

--- a/lib/Target/R600/SIInstrInfo.td

+++ b/lib/Target/R600/SIInstrInfo.td

@@ -1240,6 +1240,7 @@ multiclass MUBUF_Load_Helper <bits<7> op, string asm, RegisterClass regClass,

 multiclass MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass,

                           ValueType store_vt, SDPatternOperator st> {

+  let mayLoad = 0, mayStore = 1 in {

   let addr64 = 0, lds = 0 in {

     def "" : MUBUF <

@@ -1298,6 +1299,7 @@ multiclass MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass

       let tfe = 0;

       let soffset = 128; // ZERO

    }

+   } // End mayLoad = 0, mayStore = 1

 }

 class FLAT_Load_Helper <bits<7> op, string asm, RegisterClass regClass> :

diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td

index 00ce9bf..3a969e7 100644

--- a/lib/Target/R600/SIInstructions.td

+++ b/lib/Target/R600/SIInstructions.td

@@ -1856,13 +1856,14 @@ multiclass SI_SPILL_SGPR <RegisterClass sgpr_class> {

   def _SAVE : InstSI <

     (outs),

-    (ins sgpr_class:$src, i32imm:$frame_idx),

+    (ins sgpr_class:$src, i32imm:$frame_idx, SReg_64:$scratch_ptr,

+         SReg_32:$scratch_offset),

     "", []

   >;

   def _RESTORE : InstSI <

     (outs sgpr_class:$dst),

-    (ins i32imm:$frame_idx),

+    (ins i32imm:$frame_idx, SReg_64:$scratch_ptr, SReg_32:$scratch_offset),

     "", []

   >;

@@ -1877,13 +1878,14 @@ defm SI_SPILL_S512 : SI_SPILL_SGPR <SReg_512>;

 multiclass SI_SPILL_VGPR <RegisterClass vgpr_class> {

   def _SAVE : InstSI <

     (outs),

-    (ins vgpr_class:$src, i32imm:$frame_idx),

+    (ins vgpr_class:$src, i32imm:$frame_idx, SReg_64:$scratch_ptr,

+         SReg_32:$scratch_offset),

     "", []

   >;

   def _RESTORE : InstSI <

     (outs vgpr_class:$dst),

-    (ins i32imm:$frame_idx),

+    (ins i32imm:$frame_idx, SReg_64:$scratch_ptr, SReg_32:$scratch_offset),

     "", []

   >;

 }

diff --git a/lib/Target/R600/SIMachineFunctionInfo.cpp b/lib/Target/R600/SIMachineFunctionInfo.cpp

index d58f31d..198dd56 100644

--- a/lib/Target/R600/SIMachineFunctionInfo.cpp

+++ b/lib/Target/R600/SIMachineFunctionInfo.cpp

@@ -29,6 +29,7 @@ void SIMachineFunctionInfo::anchor() {}

 SIMachineFunctionInfo::SIMachineFunctionInfo(const MachineFunction &MF)

   : AMDGPUMachineFunction(MF),

     TIDReg(AMDGPU::NoRegister),

+    HasSpilledVGPRs(false),

     PSInputAddr(0),

     NumUserSGPRs(0),

     LDSWaveSpillSize(0) { }

@@ -50,7 +51,7 @@ SIMachineFunctionInfo::SpilledReg SIMachineFunctionInfo::getSpilledReg(

   struct SpilledReg Spill;

   if (!LaneVGPRs.count(LaneVGPRIdx)) {

-    unsigned LaneVGPR = TRI->findUnusedVGPR(MRI);

+    unsigned LaneVGPR = TRI->findUnusedRegister(MRI, &AMDGPU::VGPR_32RegClass);

     LaneVGPRs[LaneVGPRIdx] = LaneVGPR;

     MRI.setPhysRegUsed(LaneVGPR);

diff --git a/lib/Target/R600/SIMachineFunctionInfo.h b/lib/Target/R600/SIMachineFunctionInfo.h

index 6bb8f9d..7185271 100644

--- a/lib/Target/R600/SIMachineFunctionInfo.h

+++ b/lib/Target/R600/SIMachineFunctionInfo.h

@@ -29,6 +29,7 @@ class SIMachineFunctionInfo : public AMDGPUMachineFunction {

   void anchor() override;

   unsigned TIDReg;

+  bool HasSpilledVGPRs;

 public:

@@ -52,6 +53,8 @@ public:

   bool hasCalculatedTID() const { return TIDReg != AMDGPU::NoRegister; };

   unsigned getTIDReg() const { return TIDReg; };

   void setTIDReg(unsigned Reg) { TIDReg = Reg; }

+  bool hasSpilledVGPRs() const { return HasSpilledVGPRs; }

+  void setHasSpilledVGPRs(bool Spill = true) { HasSpilledVGPRs = Spill; }

   unsigned getMaximumWorkGroupSize(const MachineFunction &MF) const;

 };

diff --git a/lib/Target/R600/SIPrepareScratchRegs.cpp b/lib/Target/R600/SIPrepareScratchRegs.cpp

new file mode 100644

index 0000000..32010f0

--- /dev/null

+++ b/lib/Target/R600/SIPrepareScratchRegs.cpp

@@ -0,0 +1,198 @@

+//===-- SIPrepareScratchRegs.cpp - Use predicates for control flow --------===//

+//

+//                     The LLVM Compiler Infrastructure

+//

+// This file is distributed under the University of Illinois Open Source

+// License. See LICENSE.TXT for details.

+//

+//===----------------------------------------------------------------------===//

+//

+/// \file

+/// 

+/// This pass loads scratch pointer and scratch offset into a register or a

+/// frame index which can be used anywhere in the program.  These values will

+/// be used for spilling VGPRs.

+///

+//===----------------------------------------------------------------------===//

+

+#include "AMDGPU.h"

+#include "AMDGPUSubtarget.h"

+#include "SIDefines.h"

+#include "SIInstrInfo.h"

+#include "SIMachineFunctionInfo.h"

+#include "llvm/CodeGen/MachineFrameInfo.h"

+#include "llvm/CodeGen/MachineFunction.h"

+#include "llvm/CodeGen/MachineFunctionPass.h"

+#include "llvm/CodeGen/MachineInstrBuilder.h"

+#include "llvm/CodeGen/MachineRegisterInfo.h"

+#include "llvm/CodeGen/RegisterScavenging.h"

+#include "llvm/IR/Function.h"

+#include "llvm/IR/LLVMContext.h"

+

+using namespace llvm;

+

+namespace {

+

+class SIPrepareScratchRegs : public MachineFunctionPass {

+

+private:

+  static char ID;

+

+public:

+  SIPrepareScratchRegs() : MachineFunctionPass(ID) { }

+

+  bool runOnMachineFunction(MachineFunction &MF) override;

+

+  const char *getPassName() const override {

+    return "SI prepare scratch registers";

+  }

+

+};

+

+} // End anonymous namespace

+

+char SIPrepareScratchRegs::ID = 0;

+

+FunctionPass *llvm::createSIPrepareScratchRegs() {

+  return new SIPrepareScratchRegs();

+}

+

+// FIXME: Insert waits listed in Table 4.2 "Required User-Inserted Wait States"

+// around other non-memory instructions.</pre>

      </div>

    </blockquote>

    Comment copied from other pass?<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+bool SIPrepareScratchRegs::runOnMachineFunction(MachineFunction &MF) {

+  SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

+  const SIInstrInfo *TII =

+      static_cast<const SIInstrInfo *>(MF.getSubtarget().getInstrInfo());

+  const SIRegisterInfo *TRI = &TII->getRegisterInfo();

+  MachineRegisterInfo &MRI = MF.getRegInfo();

+  MachineFrameInfo *FrameInfo = MF.getFrameInfo();

+  MachineBasicBlock *Entry = MF.begin();

+  MachineBasicBlock::iterator I = Entry->begin();

+  DebugLoc DL = I->getDebugLoc();

+

+  // FIXME: If we don't have enough VGPRs for SGPR spilling we will need to do

+  // run this pass.</pre>

      </div>

    </blockquote>

    Grammar: "we will need to do"<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+  if (!MFI->hasSpilledVGPRs())

+    return false;

+

+  unsigned ScratchPtrPreloadReg =

+      TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_PTR);

+  unsigned ScratchOffsetPreloadReg =

+      TRI->getPreloadedValue(MF, SIRegisterInfo::SCRATCH_WAVE_OFFSET);

+

+  if (!Entry->isLiveIn(ScratchPtrPreloadReg))

+    Entry->addLiveIn(ScratchPtrPreloadReg);

+

+  if (!Entry->isLiveIn(ScratchOffsetPreloadReg))

+    Entry->addLiveIn(ScratchOffsetPreloadReg);

+

+  // Load the scratch pointer

+  unsigned ScratchPtrReg =

+      TRI->findUnusedRegister(MRI, &AMDGPU::SGPR_64RegClass);

+  int ScratchPtrFI = ~0;</pre>

      </div>

    </blockquote>

    Initialize to -1 since it's signed<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+

+  if (ScratchPtrReg != AMDGPU::NoRegister) {

+    // Found a SGPR to use.</pre>

      </div>

    </blockquote>

    Grammar: an SGPR<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    MRI.setPhysRegUsed(ScratchPtrReg);

+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::S_MOV_B64), ScratchPtrReg)

+            .addReg(ScratchPtrPreloadReg);

+  } else {

+    // No SGPR is available, we must spill.

+    ScratchPtrFI = FrameInfo->CreateSpillStackObject(8, 4);

+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::SI_SPILL_S64_SAVE))

+            .addReg(ScratchPtrPreloadReg)

+            .addFrameIndex(ScratchPtrFI);

+  }

+

+  // load the scratch offset</pre>

      </div>

    </blockquote>

    Capitalize / period comment<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+  unsigned ScratchOffsetReg =

+      TRI->findUnusedRegister(MRI, &AMDGPU::SGPR_32RegClass);

+  int ScratchOffsetFI = ~0;

+

+  if (ScratchOffsetReg != AMDGPU::NoRegister) {

+    // Found an SGPR to use

+    MRI.setPhysRegUsed(ScratchOffsetReg);

+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::S_MOV_B32), ScratchOffsetReg)

+            .addReg(ScratchOffsetPreloadReg);

+  } else {

+    // No SGPR is available, we must spill.

+    ScratchOffsetFI = FrameInfo->CreateSpillStackObject(4,4);

+    BuildMI(*Entry, I, DL, TII->get(AMDGPU::SI_SPILL_S32_SAVE))

+            .addReg(ScratchOffsetPreloadReg)

+            .addFrameIndex(ScratchOffsetFI);

+  }

+

+

+  // Now that we have the scratch pointer and offset values, we need to

+  // add them to all the SI_SPILL_V* instructions.

+

+  RegScavenger RS;

+  bool UseRegScavenger =

+      (ScratchPtrReg == AMDGPU::NoRegister ||

+      ScratchOffsetReg == AMDGPU::NoRegister);

+  for (MachineFunction::iterator BI = MF.begin(), BE = MF.end();

+       BI != BE; ++BI) {

+

+    MachineBasicBlock &MBB = *BI;

+    if (UseRegScavenger)

+      RS.enterBasicBlock(&MBB);

+

+    for (MachineBasicBlock::iterator I = MBB.begin(), E = MBB.end();

+         I != E; ++I) {

+      MachineInstr &MI = *I;

+      DebugLoc DL = MI.getDebugLoc();

+      switch(MI.getOpcode()) {

+        default: break;;

+        case AMDGPU::SI_SPILL_V512_SAVE:

+        case AMDGPU::SI_SPILL_V256_SAVE:

+        case AMDGPU::SI_SPILL_V128_SAVE:

+        case AMDGPU::SI_SPILL_V96_SAVE:

+        case AMDGPU::SI_SPILL_V64_SAVE:

+        case AMDGPU::SI_SPILL_V32_SAVE:

+        case AMDGPU::SI_SPILL_V32_RESTORE:

+        case AMDGPU::SI_SPILL_V64_RESTORE:

+        case AMDGPU::SI_SPILL_V128_RESTORE:

+        case AMDGPU::SI_SPILL_V256_RESTORE:

+        case AMDGPU::SI_SPILL_V512_RESTORE:

+

+          // Scratch Pointer

+          if (ScratchPtrReg == AMDGPU::NoRegister) {

+            ScratchPtrReg = RS.scavengeRegister(&AMDGPU::SGPR_64RegClass, 0);

+            BuildMI(MBB, I, DL, TII->get(AMDGPU::SI_SPILL_S64_RESTORE),

+                    ScratchPtrReg)

+                    .addFrameIndex(ScratchPtrFI)

+                    .addReg(AMDGPU::NoRegister)

+                    .addReg(AMDGPU::NoRegister);

+          } else if (!MBB.isLiveIn(ScratchPtrReg)) {

+            MBB.addLiveIn(ScratchPtrReg);

+          }

+

+          if (ScratchOffsetReg == AMDGPU::NoRegister) {

+            ScratchOffsetReg = RS.scavengeRegister(&AMDGPU::SGPR_32RegClass, 0);

+            BuildMI(MBB, I, DL, TII->get(AMDGPU::SI_SPILL_S32_RESTORE),

+                    ScratchOffsetReg)

+                    .addFrameIndex(ScratchOffsetFI)

+                    .addReg(AMDGPU::NoRegister)

+                    .addReg(AMDGPU::NoRegister);

+          } else if (!MBB.isLiveIn(ScratchOffsetReg)) {

+            MBB.addLiveIn(ScratchOffsetReg);

+          }

+

+          if (ScratchPtrReg == AMDGPU::NoRegister ||

+              ScratchOffsetReg == AMDGPU::NoRegister) {

+            LLVMContext &Ctx = MF.getFunction()->getContext();

+            Ctx.emitError("Ran out of SGPRs for spilling VGPRs");

+            ScratchPtrReg = AMDGPU::SGPR0;

+            ScratchOffsetReg = AMDGPU::SGPR0;

+          }

+          MI.getOperand(2).setReg(ScratchPtrReg);

+          MI.getOperand(3).setReg(ScratchOffsetReg);

+

+          break;

+      }

+      if (UseRegScavenger)

+        RS.forward();

+    }

+  }

+  return true;

+}

diff --git a/lib/Target/R600/SIRegisterInfo.cpp b/lib/Target/R600/SIRegisterInfo.cpp

index cffea12..27abe9a 100644

--- a/lib/Target/R600/SIRegisterInfo.cpp

+++ b/lib/Target/R600/SIRegisterInfo.cpp

@@ -23,6 +23,7 @@

 #include "llvm/IR/Function.h"

 #include "llvm/IR/LLVMContext.h"

+#include "llvm/Support/Debug.h"

 using namespace llvm;

 SIRegisterInfo::SIRegisterInfo(const AMDGPUSubtarget &st)

@@ -92,6 +93,84 @@ static unsigned getNumSubRegsForSpillOp(unsigned Op) {

   }

 }

+void SIRegisterInfo::buildScratchLoadStore(MachineBasicBlock::iterator MI,

+                                           unsigned LoadStoreOp,

+                                           unsigned Value,

+                                           unsigned ScratchPtr,

+                                           unsigned ScratchOffset,

+                                           int64_t Offset,

+                                           RegScavenger *RS) const {

+

+  const SIInstrInfo <b class="moz-txt-star"><span class="moz-txt-tag">*</span>TII = static_cast<const SIInstrInfo<span class="moz-txt-tag">*</span></b>>(ST.getInstrInfo());

+  MachineBasicBlock *MBB = MI->getParent();

+  const MachineFunction *MF = MI->getParent()->getParent();

+  LLVMContext &Ctx = MF->getFunction()->getContext();

+  DebugLoc DL = MI->getDebugLoc();

+  bool IsLoad = TII->get(LoadStoreOp).mayLoad();

+

+  bool RanOutOfSGPRs = false;

+  unsigned SOffset = ScratchOffset;

+

+  unsigned RsrcReg = RS->scavengeRegister(&AMDGPU::SReg_128RegClass, MI, 0);

+  if (RsrcReg == AMDGPU::NoRegister) {

+    RanOutOfSGPRs = true;

+    RsrcReg = AMDGPU::SGPR0_SGPR1_SGPR2_SGPR3;

+  }

+

+  unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());

+  unsigned Size = NumSubRegs * 4;

+

+  uint64_t Rsrc = AMDGPU::RSRC_DATA_FORMAT | AMDGPU::RSRC_TID_ENABLE |

+                  0xffffffff; // Size

+

+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B64),

+          getSubReg(RsrcReg, AMDGPU::sub0_sub1))

+          .addReg(ScratchPtr)

+          .addReg(RsrcReg, RegState::ImplicitDefine);

+

+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32),

+          getSubReg(RsrcReg, AMDGPU::sub2))

+          .addImm(Rsrc & 0xffffffff)

+          .addReg(RsrcReg, RegState::ImplicitDefine);

+

+  BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32),

+          getSubReg(RsrcReg, AMDGPU::sub3))

+          .addImm(Rsrc >> 32)

+          .addReg(RsrcReg, RegState::ImplicitDefine);

+

+  if (!isUInt<12>(Offset + Size)) {

+    SOffset = RS->scavengeRegister(&AMDGPU::SGPR_32RegClass, MI, 0);

+    if (SOffset == AMDGPU::NoRegister) {

+      RanOutOfSGPRs = true;

+      SOffset = AMDGPU::SGPR0;

+    }

+    BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_ADD_U32), SOffset)

+            .addReg(ScratchOffset)

+            .addImm(Offset);

+    Offset = 0;

+  }

+

+  if (RanOutOfSGPRs)

+    Ctx.emitError("Ran out of SGPRs for spilling VGPRS");</pre>

      </div>

    </blockquote>

    Errors should be lowercased<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+

+  for (unsigned i = 0, e = NumSubRegs; i != e; ++i, Offset += 4) {

+    unsigned SubReg = NumSubRegs > 1 ?

+        getPhysRegSubReg(Value, &AMDGPU::VGPR_32RegClass, i) :

+        Value;

+    bool IsKill = (i == e - 1);

+

+    BuildMI(*MBB, MI, DL, TII->get(LoadStoreOp))

+            .addReg(SubReg, getDefRegState(IsLoad))

+            .addReg(RsrcReg, getKillRegState(IsKill))

+            .addImm(Offset)

+            .addReg(SOffset, getKillRegState(IsKill))

+            .addImm(0) // glc

+            .addImm(0) // slc

+            .addImm(0) // tfe

+            .addReg(Value, RegState::Implicit | getDefRegState(IsLoad));

+  }

+}

+

 void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,

                                         int SPAdj, unsigned FIOperandNum,

                                         RegScavenger *RS) const {

@@ -160,7 +239,8 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,

         BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)

                 .addReg(Spill.VGPR)

-                .addImm(Spill.Lane);

+                .addImm(Spill.Lane)

+                .addReg(MI->getOperand(0).getReg(), RegState::ImplicitDefine);

         if (isM0) {

           BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)

                   .addReg(SubReg);

@@ -177,71 +257,24 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,

     case AMDGPU::SI_SPILL_V128_SAVE:

     case AMDGPU::SI_SPILL_V96_SAVE:

     case AMDGPU::SI_SPILL_V64_SAVE:

-    case AMDGPU::SI_SPILL_V32_SAVE: {

-      unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());

-      unsigned SrcReg = MI->getOperand(0).getReg();

-      int64_t Offset = FrameInfo->getObjectOffset(Index);

-      unsigned Size = NumSubRegs * 4;

-      unsigned TmpReg = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);

-

-      for (unsigned i = 0, e = NumSubRegs; i != e; ++i) {

-        unsigned SubReg = NumSubRegs > 1 ?

-            getPhysRegSubReg(SrcReg, &AMDGPU::VGPR_32RegClass, i) :

-            SrcReg;

-        Offset += (i * 4);

-        MFI->LDSWaveSpillSize = std::max((unsigned)Offset + 4, (unsigned)MFI->LDSWaveSpillSize);

-

-        unsigned AddrReg = TII->calculateLDSSpillAddress(*MBB, MI, RS, TmpReg,

-                                                         Offset, Size);

-

-        if (AddrReg == AMDGPU::NoRegister) {

-           LLVMContext &Ctx = MF->getFunction()->getContext();

-           Ctx.emitError("Ran out of VGPRs for spilling VGPRS");

-           AddrReg = AMDGPU::VGPR0;

-        }

-

-        // Store the value in LDS

-        BuildMI(*MBB, MI, DL, TII->get(AMDGPU::DS_WRITE_B32))

-                .addImm(0) // gds

-                .addReg(AddrReg, RegState::Kill) // addr

-                .addReg(SubReg) // data0

-                .addImm(0); // offset

-      }

-

+    case AMDGPU::SI_SPILL_V32_SAVE:

+      buildScratchLoadStore(MI, AMDGPU::BUFFER_STORE_DWORD_OFFSET,

+                            MI->getOperand(0).getReg(),

+                            MI->getOperand(2).getReg(),

+                            MI->getOperand(3).getReg(),

+                            FrameInfo->getObjectOffset(Index), RS);

       MI->eraseFromParent();

       break;

-    }

     case AMDGPU::SI_SPILL_V32_RESTORE:

     case AMDGPU::SI_SPILL_V64_RESTORE:

     case AMDGPU::SI_SPILL_V128_RESTORE:

     case AMDGPU::SI_SPILL_V256_RESTORE:

     case AMDGPU::SI_SPILL_V512_RESTORE: {

-      unsigned NumSubRegs = getNumSubRegsForSpillOp(MI->getOpcode());

-      unsigned DstReg = MI->getOperand(0).getReg();

-      int64_t Offset = FrameInfo->getObjectOffset(Index);

-      unsigned Size = NumSubRegs * 4;

-      unsigned TmpReg = RS->scavengeRegister(&AMDGPU::VGPR_32RegClass, MI, 0);

-

-      // FIXME: We could use DS_READ_B64 here to optimize for larger registers.

-      for (unsigned i = 0, e = NumSubRegs; i != e; ++i) {

-        unsigned SubReg = NumSubRegs > 1 ?

-            getPhysRegSubReg(DstReg, &AMDGPU::VGPR_32RegClass, i) :

-            DstReg;

-

-        Offset += (i * 4);

-        unsigned AddrReg = TII->calculateLDSSpillAddress(*MBB, MI, RS, TmpReg,

-                                                          Offset, Size);

-        if (AddrReg == AMDGPU::NoRegister) {

-           LLVMContext &Ctx = MF->getFunction()->getContext();

-           Ctx.emitError("Ran out of VGPRs for spilling VGPRs");

-           AddrReg = AMDGPU::VGPR0;

-        }

-

-        BuildMI(*MBB, MI, DL, TII->get(AMDGPU::DS_READ_B32), SubReg)

-                .addImm(0) // gds

-                .addReg(AddrReg, RegState::Kill) // addr

-                .addImm(0); //offset

-      }

+      buildScratchLoadStore(MI, AMDGPU::BUFFER_LOAD_DWORD_OFFSET,

+                            MI->getOperand(0).getReg(),

+                            MI->getOperand(2).getReg(),

+                            MI->getOperand(3).getReg(),

+                            FrameInfo->getObjectOffset(Index), RS);</pre>

      </div>

    </blockquote>

    Use named operands?<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

       MI->eraseFromParent();

       break;

     }

@@ -452,9 +485,8 @@ unsigned SIRegisterInfo::getPreloadedValue(const MachineFunction &MF,

 /// \brief Returns a register that is not used at any point in the function.

 ///        If all registers are used, then this function will return

 //         AMDGPU::NoRegister.

-unsigned SIRegisterInfo::findUnusedVGPR(const MachineRegisterInfo &MRI) const {

-

-  const TargetRegisterClass *RC = &AMDGPU::VGPR_32RegClass;

+unsigned SIRegisterInfo::findUnusedRegister(const MachineRegisterInfo &MRI,

+                                           const TargetRegisterClass *RC) const {

   for (TargetRegisterClass::iterator I = RC->begin(), E = RC->end();

        I != E; ++I) {

diff --git a/lib/Target/R600/SIRegisterInfo.h b/lib/Target/R600/SIRegisterInfo.h

index c7e54db..f1d78b4 100644

--- a/lib/Target/R600/SIRegisterInfo.h

+++ b/lib/Target/R600/SIRegisterInfo.h

@@ -113,7 +113,14 @@ struct SIRegisterInfo : public AMDGPURegisterInfo {

   unsigned getPreloadedValue(const MachineFunction &MF,

                              enum PreloadedValue Value) const;

-  unsigned findUnusedVGPR(const MachineRegisterInfo &MRI) const;

+  unsigned findUnusedRegister(const MachineRegisterInfo &MRI,

+                              const TargetRegisterClass *RC) const;

+

+private:

+  void buildScratchLoadStore(MachineBasicBlock::iterator MI,

+                             unsigned LoadStoreOp, unsigned Value,

+                             unsigned ScratchPtr, unsigned ScratchOffset,

+                             int64_t Offset, RegScavenger *RS) const;

 };

 } // End namespace llvm

<div class="moz-txt-sig">-- 

2.0.4

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0005-MISched-Fix-moving-stores-across-barriers.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From c151bc73cdf7e1f40ec90c2d8dbc93ae34673890 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Tue, 2 Dec 2014 17:26:08 +0000

Subject: [PATCH 5/6] MISched: Fix moving stores across barriers

This fixes an issue with ScheduleDAGInstrs::buildSchedGraph

where stores without an underlying object would not be added

as a predecessor to the current BarrierChain.

---

 lib/CodeGen/ScheduleDAGInstrs.cpp  |  9 +++++--

 test/CodeGen/R600/store-barrier.ll | 52 ++++++++++++++++++++++++++++++++++++++

 2 files changed, 59 insertions(+), 2 deletions(-)

 create mode 100644 test/CodeGen/R600/store-barrier.ll

diff --git a/lib/CodeGen/ScheduleDAGInstrs.cpp b/lib/CodeGen/ScheduleDAGInstrs.cpp

index d8d8422..ee8b5c2 100644

--- a/lib/CodeGen/ScheduleDAGInstrs.cpp

+++ b/lib/CodeGen/ScheduleDAGInstrs.cpp

@@ -794,6 +794,7 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,

   for (MachineBasicBlock::iterator MII = RegionEnd, MIE = RegionBegin;

        MII != MIE; --MII) {

     MachineInstr *MI = std::prev(MII);

+

     if (MI && DbgMI) {

       DbgValues.push_back(std::make_pair(DbgMI, MI));

       DbgMI = nullptr;

@@ -920,6 +921,12 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,

       AliasMemDefs.clear();

       AliasMemUses.clear();

     } else if (MI->mayStore()) {

+      // Add dependence on barrier chain, if needed.

+      // There is no point to check aliasing on barrier event. Even if

+      // SU and barrier <span class="moz-txt-underscore"><span class="moz-txt-tag">_</span>could<span class="moz-txt-tag">_</span></span> be reordered, they should not. In addition,

+      // we have lost all RejectMemNodes below barrier.

+      if (BarrierChain)

+        BarrierChain->addPred(SDep(SU, SDep::Barrier));

       UnderlyingObjectsVector Objs;

       getUnderlyingObjectsForInstr(MI, MFI, Objs);

@@ -993,8 +1000,6 @@ void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,

       // There is no point to check aliasing on barrier event. Even if

       // SU and barrier <span class="moz-txt-underscore"><span class="moz-txt-tag">_</span>could<span class="moz-txt-tag">_</span></span> be reordered, they should not. In addition,

       // we have lost all RejectMemNodes below barrier.

-      if (BarrierChain)

-        BarrierChain->addPred(SDep(SU, SDep::Barrier));

     } else if (MI->mayLoad()) {

       bool MayAlias = true;

       if (MI->isInvariantLoad(AA)) {

diff --git a/test/CodeGen/R600/store-barrier.ll b/test/CodeGen/R600/store-barrier.ll

new file mode 100644

index 0000000..229cd8f

--- /dev/null

+++ b/test/CodeGen/R600/store-barrier.ll

@@ -0,0 +1,52 @@

+; RUN: llc -march=r600 -mcpu=SI -verify-machineinstrs -mattr=+load-store-opt -enable-misched < %s | FileCheck  --check-prefix=CHECK %s

+; RUN: llc -march=r600 -mcpu=bonaire -verify-machineinstrs -mattr=+load-store-opt -enable-misched < %s | FileCheck  --check-prefix=CHECK %s

+

+; This test is for a bug in the machine scheduler where stores without

+; an underlying object would be moved across the barrier.  In this

+; test, the <2 x i8> store will be split into two i8 stores, so they

+; won't have an underlying object.

+

+; CHECK-LABEL: {{^}}test:

+; CHECK: ds_write_b8

+; CHECK: ds_write_b8

+; CHECK: s_barrier</pre>

      </div>

    </blockquote>

    Should probably check something after the barrier also<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+; Function Attrs: nounwind

+define void @test(<2 x i8> addrspace(3)* nocapture %arg, <2 x i8> addrspace(1)* nocapture readonly %arg1, i32 addrspace(1)* nocapture readonly %arg2, <2 x i8> addrspace(1)* nocapture %arg3, i32 %arg4, i64 %tmp9) #0 {

+bb:

+  %tmp10 = getelementptr inbounds i32 addrspace(1)* %arg2, i64 %tmp9

+  %tmp13 = load i32 addrspace(1)* %tmp10, align 2

+  %tmp14 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp13

+  %tmp15 = load <2 x i8> addrspace(3)* %tmp14, align 2

+  %tmp16 = add i32 %tmp13, 1

+  %tmp17 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp16

+  store <2 x i8> %tmp15, <2 x i8> addrspace(3)* %tmp17, align 2

+  tail call void @llvm.AMDGPU.barrier.local() #2

+  %tmp25 = load i32 addrspace(1)* %tmp10, align 4

+  %tmp26 = sext i32 %tmp25 to i64

+  %tmp27 = sext i32 %arg4 to i64

+  %tmp28 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp25, i32 %arg4

+  %tmp29 = load i8 addrspace(3)* %tmp28, align 1

+  %tmp30 = getelementptr inbounds <2 x i8> addrspace(1)* %arg3, i64 %tmp26, i64 %tmp27

+  store i8 %tmp29, i8 addrspace(1)* %tmp30, align 1

+  %tmp32 = getelementptr inbounds <2 x i8> addrspace(3)* %arg, i32 %tmp25, i32 0

+  %tmp33 = load i8 addrspace(3)* %tmp32, align 1

+  %tmp35 = getelementptr inbounds <2 x i8> addrspace(1)* %arg3, i64 %tmp26, i64 0

+  store i8 %tmp33, i8 addrspace(1)* %tmp35, align 1

+  ret void

+}

+

+; Function Attrs: noduplicate nounwind

+declare void @llvm.AMDGPU.barrier.local() #2

+

+attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-realign-stack" "stack-protector-buffer-size"="8" "unsafe-fp-math"="false" "use-soft-float"="false" }

+attributes #1 = { nounwind readnone }

+attributes #2 = { noduplicate nounwind }

+

+!opencl.kernels = !{!0}

+

+!0 = metadata !{void (<2 x i8> addrspace(3)*, <2 x i8> addrspace(1)*, i32 addrspace(1)*, <2 x i8> addrspace(1)*, i32, i64)* @test}

+!3 = metadata !{metadata !4, metadata !4, i64 0}

+!4 = metadata !{metadata !"int", metadata !5, i64 0}

+!5 = metadata !{metadata !"omnipotent char", metadata !6, i64 0}

+!6 = metadata !{metadata !"Simple C/C++ TBAA"}

+!7 = metadata !{metadata !5, metadata !5, i64 0}

<div class="moz-txt-sig">-- 

2.0.4

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"><legend

          class="mimeAttachmentHeaderName">0006-R600-SI-Define-a-schedule-model-and-enable-the-gener.patch</legend></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">From db8ef9632e44dc76668758fd83981057e3bcfac1 Mon Sep 17 00:00:00 2001

From: Tom Stellard <a moz-do-not-send="true" class="moz-txt-link-rfc2396E" href="mailto:thomas.stellard@amd.com"><thomas.stellard@amd.com></a>

Date: Fri, 19 Jul 2013 11:50:00 -0700

Subject: [PATCH 6/6] R600/SI: Define a schedule model and enable the generic

 machine scheduler

The schedule model is not complete yet, and could be improved.

---

 lib/Target/R600/AMDGPUSubtarget.cpp              | 14 ++++-

 lib/Target/R600/AMDGPUSubtarget.h                |  6 +-

 lib/Target/R600/Processors.td                    | 24 ++++----

 lib/Target/R600/SIInstrFormats.td                | 10 +++-

 lib/Target/R600/SIInstructions.td                | 48 ++++++++++++++-

 lib/Target/R600/SIRegisterInfo.cpp               | 54 ++++++++++++++++-

 lib/Target/R600/SIRegisterInfo.h                 | 12 +++-

 lib/Target/R600/SISchedule.td                    | 76 +++++++++++++++++++++++-

 test/CodeGen/R600/atomic_cmp_swap_local.ll       |  6 +-

 test/CodeGen/R600/ctpop.ll                       |  4 +-

 test/CodeGen/R600/ds_read2st64.ll                |  4 +-

 test/CodeGen/R600/fceil64.ll                     |  4 +-

 test/CodeGen/R600/ffloor.ll                      |  4 +-

 test/CodeGen/R600/fmax3.ll                       |  6 +-

 test/CodeGen/R600/fmin3.ll                       |  6 +-

 test/CodeGen/R600/fneg-fabs.f64.ll               |  2 +-

 test/CodeGen/R600/ftrunc.f64.ll                  |  4 +-

 test/CodeGen/R600/llvm.memcpy.ll                 | 34 +++++------

 test/CodeGen/R600/local-atomics.ll               |  4 +-

 test/CodeGen/R600/local-atomics64.ll             |  2 +-

 test/CodeGen/R600/local-memory-two-objects.ll    |  4 +-

 test/CodeGen/R600/si-triv-disjoint-mem-access.ll |  2 +-

 test/CodeGen/R600/smrd.ll                        | 10 ++--

 test/CodeGen/R600/wait.ll                        |  5 +-

 test/CodeGen/R600/zero_extend.ll                 |  2 +-

 25 files changed, 271 insertions(+), 76 deletions(-)

diff --git a/lib/Target/R600/AMDGPUSubtarget.cpp b/lib/Target/R600/AMDGPUSubtarget.cpp

index 9d09a19..5a3785f 100644

--- a/lib/Target/R600/AMDGPUSubtarget.cpp

+++ b/lib/Target/R600/AMDGPUSubtarget.cpp

@@ -19,8 +19,7 @@

 #include "SIInstrInfo.h"

 #include "SIISelLowering.h"

 #include "llvm/ADT/SmallString.h"

-

-#include "llvm/ADT/SmallString.h"

+#include "llvm/CodeGen/MachineScheduler.h"

 using namespace llvm;

@@ -107,3 +106,14 @@ unsigned AMDGPUSubtarget::getStackEntrySize() const {

     llvm_unreachable("Illegal wavefront size.");

   }

 }

+

+void AMDGPUSubtarget::overrideSchedPolicy(MachineSchedPolicy &Policy,

+                                          MachineInstr *begin,

+                                          MachineInstr *end,

+                                          unsigned NumRegionInstrs) const {

+  if (getGeneration() >= SOUTHERN_ISLANDS) {

+    Policy.ShouldTrackPressure = true;;</pre>

      </div>

    </blockquote>

    Double ;<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    Policy.OnlyTopDown = false;

+    Policy.OnlyBottomUp = false;</pre>

      </div>

    </blockquote>

    Is there a reason for selecting this? There should be a comment for

    why the policy is what it is<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+  }

+}

diff --git a/lib/Target/R600/AMDGPUSubtarget.h b/lib/Target/R600/AMDGPUSubtarget.h

index f71d80a..3e44c66 100644

--- a/lib/Target/R600/AMDGPUSubtarget.h

+++ b/lib/Target/R600/AMDGPUSubtarget.h

@@ -199,9 +199,13 @@ public:

   }

   bool enableMachineScheduler() const override {

-    return getGeneration() <= NORTHERN_ISLANDS;

+    return true;

   }

+  void overrideSchedPolicy(MachineSchedPolicy &Policy,

+                           MachineInstr *begin, MachineInstr *end,

+                           unsigned NumRegionInstrs) const override;

+

   // Helper functions to simplify if statements

   bool isTargetELF() const {

     return false;

diff --git a/lib/Target/R600/Processors.td b/lib/Target/R600/Processors.td

index ce17d7c..17422f9 100644

--- a/lib/Target/R600/Processors.td

+++ b/lib/Target/R600/Processors.td

@@ -83,28 +83,30 @@ def : Proc<"cayman",     R600_VLIW4_Itin,

 // Southern Islands

 //===----------------------------------------------------------------------===//

-def : Proc<"SI",         SI_Itin, [FeatureSouthernIslands]>;

+// FIXME: Which of these should use the half speed?</pre>

      </div>

    </blockquote>

    I believe this can be different for different versions of tahiti

    also. This should probably be a subtarget feature settable by the

    driver.<br>

    <br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

-def : Proc<"tahiti",     SI_Itin, [FeatureSouthernIslands]>;

+def : ProcessorModel<"SI",         SIFullSpeedModel, [FeatureSouthernIslands]>;

-def : Proc<"pitcairn",   SI_Itin, [FeatureSouthernIslands]>;

+def : ProcessorModel<"tahiti",     SIFullSpeedModel, [FeatureSouthernIslands]>;

-def : Proc<"verde",      SI_Itin, [FeatureSouthernIslands]>;

+def : ProcessorModel<"pitcairn",   SIFullSpeedModel, [FeatureSouthernIslands]>;</pre>

      </div>

    </blockquote>

    verde, bonaire and pitcairn are all 1/16th FP64<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

-def : Proc<"oland",      SI_Itin, [FeatureSouthernIslands]>;

+def : ProcessorModel<"verde",      SIFullSpeedModel, [FeatureSouthernIslands]>;

-def : Proc<"hainan",     SI_Itin, [FeatureSouthernIslands]>;

+def : ProcessorModel<"oland",      SIFullSpeedModel, [FeatureSouthernIslands]>;

+

+def : ProcessorModel<"hainan",     SIFullSpeedModel, [FeatureSouthernIslands]>;

 //===----------------------------------------------------------------------===//

 // Sea Islands

 //===----------------------------------------------------------------------===//

-def : Proc<"bonaire",    SI_Itin, [FeatureSeaIslands]>;

+def : ProcessorModel<"bonaire",    SIFullSpeedModel, [FeatureSeaIslands]>;

-def : Proc<"kabini",     SI_Itin, [FeatureSeaIslands]>;

+def : ProcessorModel<"kabini",     SIFullSpeedModel, [FeatureSeaIslands]>;

-def : Proc<"kaveri",     SI_Itin, [FeatureSeaIslands]>;

+def : ProcessorModel<"kaveri",     SIFullSpeedModel, [FeatureSeaIslands]>;

-def : Proc<"hawaii",     SI_Itin, [FeatureSeaIslands]>;

+def : ProcessorModel<"hawaii",     SIFullSpeedModel, [FeatureSeaIslands]>;</pre>

      </div>

    </blockquote>

    Hawaii is 1/8th FP64, but 1/4th on the workstation versions<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

-def : Proc<"mullins",    SI_Itin, [FeatureSeaIslands]>;

+def : ProcessorModel<"mullins",    SIFullSpeedModel, [FeatureSeaIslands]>;

diff --git a/lib/Target/R600/SIInstrFormats.td b/lib/Target/R600/SIInstrFormats.td

index ee1a52b..4b688e0 100644

--- a/lib/Target/R600/SIInstrFormats.td

+++ b/lib/Target/R600/SIInstrFormats.td

@@ -46,6 +46,7 @@ class InstSI <dag outs, dag ins, string asm, list<dag> pattern> :

   // Most instructions require adjustments after selection to satisfy

   // operand requirements.

   let hasPostISelHook = 1;

+  let SchedRW = [Write32Bit];

 }

 class Enc32 {

@@ -161,6 +162,8 @@ class SMRDe <bits<5> op, bits<1> imm> : Enc32 {

   let Inst{31-27} = 0x18; //encoding

 }

+let SchedRW = [WriteSALU] in {

+

 class SOP1 <bits<8> op, dag outs, dag ins, string asm, list<dag> pattern> :

     InstSI<outs, ins, asm, pattern>, SOP1e <op> {

@@ -216,6 +219,8 @@ class SOPP <bits<7> op, dag ins, string asm, list<dag> pattern = []> :

   let UseNamedOperandTable = 1;

 }

+} // let SchedRW = [WriteSALU]

+

 class SMRD <dag outs, dag ins, string asm, list<dag> pattern> :

     InstSI<outs, ins, asm, pattern> {

@@ -225,6 +230,7 @@ class SMRD <dag outs, dag ins, string asm, list<dag> pattern> :

   let mayLoad = 1;

   let hasSideEffects = 0;

   let UseNamedOperandTable = 1;

+  let SchedRW = [WriteSMEM];

 }

 //===----------------------------------------------------------------------===//

@@ -547,6 +553,7 @@ class DS <bits<8> op, dag outs, dag ins, string asm, list<dag> pattern> :

   let LGKM_CNT = 1;

   let UseNamedOperandTable = 1;

   let DisableEncoding = "$m0";

+  let SchedRW = [WriteLDS];

 }

 class MUBUF <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :

@@ -558,6 +565,7 @@ class MUBUF <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :

   let hasSideEffects = 0;

   let UseNamedOperandTable = 1;

+  let SchedRW = [WriteVMEM];

 }

 class MTBUF <dag outs, dag ins, string asm, list<dag> pattern> :

@@ -569,6 +577,7 @@ class MTBUF <dag outs, dag ins, string asm, list<dag> pattern> :

   let neverHasSideEffects = 1;

   let UseNamedOperandTable = 1;

+  let SchedRW = [WriteVMEM];

 }

 class FLAT <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :

@@ -597,5 +606,4 @@ class MIMG <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :

 }

-

 } // End Uses = [EXEC]

diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td

index 3a969e7..fd860e5 100644

--- a/lib/Target/R600/SIInstructions.td

+++ b/lib/Target/R600/SIInstructions.td

@@ -1160,6 +1160,8 @@ defm V_MOV_B32 : VOP1Inst <vop1<0x1>, "v_mov_b32", VOP_I32_I32>;

 let Uses = [EXEC] in {

+// FIXME: Specify SchedRW for READFIRSTLANE+B32</pre>

      </div>

    </blockquote>

    Comment typo in instruction name +<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+

 def V_READFIRSTLANE_B32 : VOP1 <

   0x00000002,

   (outs SReg_32:$vdst),

@@ -1170,6 +1172,8 @@ def V_READFIRSTLANE_B32 : VOP1 <

 }

+let SchedRW = [WriteConversion] in {

+

 defm V_CVT_I32_F64 : VOP1Inst <vop1<0x3>, "v_cvt_i32_f64",

   VOP_I32_F64, fp_to_sint

 >;

@@ -1223,6 +1227,8 @@ defm V_CVT_F64_U32 : VOP1Inst <vop1<0x16>, "v_cvt_f64_u32",

   VOP_F64_I32, uint_to_fp

 >;

+} // let SchedRW = [WriteConversion]

+

 defm V_FRACT_F32 : VOP1Inst <vop1<0x20>, "v_fract_f32",

   VOP_F32_F32, AMDGPUfract

 >;

@@ -1241,6 +1247,9 @@ defm V_FLOOR_F32 : VOP1Inst <vop1<0x24>, "v_floor_f32",

 defm V_EXP_F32 : VOP1Inst <vop1<0x25>, "v_exp_f32",

   VOP_F32_F32, fexp2

 >;

+

+let SchedRW = [WriteFloatTrans] in {</pre>

      </div>

    </blockquote>

    I don't think WriteFloatTrans or some of these others are useful

    ways to characterize the instructions. Are these standard

    classifications the generic scheduler uses? More useful would be

    something like QuarterRate32<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+

 defm V_LOG_CLAMP_F32 : VOP1Inst <vop1<0x26>, "v_log_clamp_f32", VOP_F32_F32>;

 defm V_LOG_F32 : VOP1Inst <vop1<0x27>, "v_log_f32",

   VOP_F32_F32, flog2

@@ -1261,6 +1270,11 @@ defm V_RSQ_LEGACY_F32 : VOP1Inst <vop1<0x2d>, "v_rsq_legacy_f32",

 defm V_RSQ_F32 : VOP1Inst <vop1<0x2e>, "v_rsq_f32",

   VOP_F32_F32, AMDGPUrsq

 >;

+

+} //let SchedRW = [WriteFloatTrans]

+

+let SchedRW = [WriteDouble] in {

+

 defm V_RCP_F64 : VOP1Inst <vop1<0x2f>, "v_rcp_f64",

   VOP_F64_F64, AMDGPUrcp

 >;

@@ -1271,12 +1285,21 @@ defm V_RSQ_F64 : VOP1Inst <vop1<0x31>, "v_rsq_f64",

 defm V_RSQ_CLAMP_F64 : VOP1Inst <vop1<0x32>, "v_rsq_clamp_f64",

   VOP_F64_F64, AMDGPUrsq_clamped

 >;

+

+} // let SchedRW = [WriteDouble];

+

 defm V_SQRT_F32 : VOP1Inst <vop1<0x33>, "v_sqrt_f32",

   VOP_F32_F32, fsqrt

 >;

+

+let SchedRW = [WriteDouble] in {

+

 defm V_SQRT_F64 : VOP1Inst <vop1<0x34>, "v_sqrt_f64",

   VOP_F64_F64, fsqrt

 >;

+

+} // let SchedRW = [WriteDouble]

+

 defm V_SIN_F32 : VOP1Inst <vop1<0x35>, "v_sin_f32",

   VOP_F32_F32, AMDGPUsin

 >;

@@ -1303,6 +1326,8 @@ defm V_MOVRELSD_B32 : VOP1Inst <vop1<0x44>, "v_movrelsd_b32", VOP_I32_I32>;

 // VINTRP Instructions

 //===----------------------------------------------------------------------===//

+// FIXME: Specify SchedRW for VINTRP insturctions.

+

 def V_INTERP_P1_F32 : VINTRP <

   0x00000000,

   (outs VReg_32:$dst),

@@ -1337,6 +1362,8 @@ def V_INTERP_MOV_F32 : VINTRP <

 // VOP2 Instructions

 //===----------------------------------------------------------------------===//

+// FIXME: Specify SchedRW for V_CNDMASK and V_*LANE_B32

+

 def V_CNDMASK_B32_e32 : VOP2 <0x00000000, (outs VReg_32:$dst),

   (ins VSrc_32:$src0, VReg_32:$src1, VCCReg:$vcc),

   "v_cndmask_b32_e32 $dst, $src0, $src1, [$vcc]",

@@ -1405,7 +1432,6 @@ defm V_MUL_U32_U24 : VOP2Inst <vop2<0xb>, "v_mul_u32_u24",

 >;

 //defm V_MUL_HI_U32_U24 : VOP2_32 <0x0000000c, "v_mul_hi_u32_u24", []>;

-

 defm V_MIN_LEGACY_F32 : VOP2Inst <vop2<0xd>, "v_min_legacy_f32",

   VOP_F32_F32_F32, AMDGPUfmin_legacy

 >;

@@ -1608,10 +1634,15 @@ defm V_SAD_U32 : VOP3Inst <vop3<0x15d>, "v_sad_u32",

 defm V_DIV_FIXUP_F32 : VOP3Inst <

   vop3<0x15f>, "v_div_fixup_f32", VOP_F32_F32_F32_F32, AMDGPUdiv_fixup

 >;

+

+let SchedRW = [WriteDouble] in {

+

 defm V_DIV_FIXUP_F64 : VOP3Inst <

   vop3<0x160>, "v_div_fixup_f64", VOP_F64_F64_F64_F64, AMDGPUdiv_fixup

 >;

+} // let SchedRW = [WriteDouble]

+

 defm V_LSHL_B64 : VOP3Inst <vop3<0x161>, "v_lshl_b64",

   VOP_I64_I64_I32, shl

 >;

@@ -1622,6 +1653,7 @@ defm V_ASHR_I64 : VOP3Inst <vop3<0x163>, "v_ashr_i64",

   VOP_I64_I64_I32, sra

 >;

+let SchedRW = [WriteDouble] in {

 let isCommutable = 1 in {

 defm V_ADD_F64 : VOP3Inst <vop3<0x164>, "v_add_f64",

@@ -1644,7 +1676,9 @@ defm V_LDEXP_F64 : VOP3Inst <vop3<0x168>, "v_ldexp_f64",

   VOP_F64_F64_I32, AMDGPUldexp

 >;

-let isCommutable = 1 in {

+} // let SchedRW = [WriteDouble]

+

+let isCommutable = 1, SchedRW = [WriteIntMUL] in {

 defm V_MUL_LO_U32 : VOP3Inst <vop3<0x169>, "v_mul_lo_u32",

   VOP_I32_I32_I32

@@ -1659,30 +1693,38 @@ defm V_MUL_HI_I32 : VOP3Inst <vop3<0x16c>, "v_mul_hi_i32",

   VOP_I32_I32_I32

 >;

-} // isCommutable = 1

+} // isCommutable = 1, SchedRW = [WriteIntMUL]

 defm V_DIV_SCALE_F32 : VOP3b_32 <vop3<0x16d>, "v_div_scale_f32", []>;

+let SchedRW = [WriteDouble] in {

 // Double precision division pre-scale.

 defm V_DIV_SCALE_F64 : VOP3b_64 <vop3<0x16e>, "v_div_scale_f64", []>;

+} // let SchedRW = [WriteDouble]

 let isCommutable = 1 in {

 defm V_DIV_FMAS_F32 : VOP3Inst <vop3<0x16f>, "v_div_fmas_f32",

   VOP_F32_F32_F32_F32, AMDGPUdiv_fmas

 >;

+

+let SchedRW = [WriteDouble] in {

 defm V_DIV_FMAS_F64 : VOP3Inst <vop3<0x170>, "v_div_fmas_f64",

   VOP_F64_F64_F64_F64, AMDGPUdiv_fmas

 >;

+} // End SchedRW = [WriteDouble]

 } // End isCommutable = 1

 //def V_MSAD_U8 : VOP3_U8 <0x00000171, "v_msad_u8", []>;

 //def V_QSAD_U8 : VOP3_U8 <0x00000172, "v_qsad_u8", []>;

 //def V_MQSAD_U8 : VOP3_U8 <0x00000173, "v_mqsad_u8", []>;

+let SchedRW = [WriteDouble] in {

 defm V_TRIG_PREOP_F64 : VOP3Inst <

   vop3<0x174>, "v_trig_preop_f64", VOP_F64_F64_I32, AMDGPUtrig_preop

 >;

+} // let SchedRW = [WriteDouble]

+

 //===----------------------------------------------------------------------===//

 // Pseudo Instructions

 //===----------------------------------------------------------------------===//

diff --git a/lib/Target/R600/SIRegisterInfo.cpp b/lib/Target/R600/SIRegisterInfo.cpp

index 27abe9a..a05ce7b 100644

--- a/lib/Target/R600/SIRegisterInfo.cpp

+++ b/lib/Target/R600/SIRegisterInfo.cpp

@@ -49,9 +49,31 @@ BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const {

   return Reserved;

 }

-unsigned SIRegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,

-                                             MachineFunction &MF) const {

-  return RC->getNumRegs();

+unsigned SIRegisterInfo::getRegPressureSetLimit(unsigned Idx) const {

+

+  unsigned SGPRLimit = getNumSGPRsAllowed(10);

+  unsigned VGPRLimit = getNumVGPRsAllowed(10);</pre>

      </div>

    </blockquote>

    Magic number 10. Should have some kind of named constant or

    subtarget feature for maximum number of waves.<br>

    There should probably also be a TODO based on the known other

    constraints on number of waves, like the used LDS size<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+

+  for (regclass_iterator I = regclass_begin(), E = regclass_end();

+       I != E; ++I) {

+

+    unsigned NumSubRegs = std::max((int)(*I)->getSize() / 4, 1);

+    unsigned Limit;

+

+    if (isSGPRClass(*I)) {

+      Limit = SGPRLimit / NumSubRegs;

+    } else {

+      Limit = VGPRLimit / NumSubRegs;

+    }

+

+    const int *Sets = getRegClassPressureSets(*I);

+    assert(Sets);

+    for (unsigned i = 0; Sets[i] != -1; ++i) {

+           if (Sets[i] == (int)Idx)

+        return Limit;</pre>

      </div>

    </blockquote>

    Weird indentation<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    }

+  }

+  return 256;

 }

 bool SIRegisterInfo::requiresRegisterScavenging(const MachineFunction &Fn) const {

@@ -496,3 +518,29 @@ unsigned SIRegisterInfo::findUnusedRegister(const MachineRegisterInfo &MRI,

   return AMDGPU::NoRegister;

 }

+unsigned SIRegisterInfo::getNumVGPRsAllowed(unsigned WaveCount) const {

+  switch(WaveCount) {</pre>

      </div>

    </blockquote>

    Space between switch and (<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+    case 10: return 24;

+    case 9:  return 28;

+    case 8:  return 32;

+    case 7:  return 36;

+    case 6:  return 40;

+    case 5:  return 48;

+    case 4:  return 64;

+    case 3:  return 84;

+    case 2:  return 128;

+    default: return 256;

+  }

+}

+

+unsigned SIRegisterInfo::getNumSGPRsAllowed(unsigned WaveCount) const {

+  switch(WaveCount) {

+    case 10: return 48;

+    case 9:  return 56;

+    case 8:  return 64;

+    case 7:  return 72;

+    case 6:  return 80;

+    case 5:  return 96;

+    default: return 103;

+  }

+}

diff --git a/lib/Target/R600/SIRegisterInfo.h b/lib/Target/R600/SIRegisterInfo.h

index f1d78b4..40acfa1 100644

--- a/lib/Target/R600/SIRegisterInfo.h

+++ b/lib/Target/R600/SIRegisterInfo.h

@@ -17,6 +17,7 @@

 #define LLVM_LIB_TARGET_R600_SIREGISTERINFO_H

 #include "AMDGPURegisterInfo.h"

+#include "llvm/Support/Debug.h"

 namespace llvm {

@@ -26,8 +27,7 @@ struct SIRegisterInfo : public AMDGPURegisterInfo {

   BitVector getReservedRegs(const MachineFunction &MF) const override;

-  unsigned getRegPressureLimit(const TargetRegisterClass *RC,

-                               MachineFunction &MF) const override;

+  unsigned getRegPressureSetLimit(unsigned Idx) const override;

   bool requiresRegisterScavenging(const MachineFunction &Fn) const override;

@@ -113,6 +113,14 @@ struct SIRegisterInfo : public AMDGPURegisterInfo {

   unsigned getPreloadedValue(const MachineFunction &MF,

                              enum PreloadedValue Value) const;

+  /// \brief Give the maximum number of VGPRs that can be used by \p WaveCount

+  ///        concurrent waves.

+  unsigned getNumVGPRsAllowed(unsigned WaveCount) const;

+

+  /// \brief Give the maximum number of SGPRs that can be used by \p WaveCount

+  ///        concurrent waves.

+  unsigned getNumSGPRsAllowed(unsigned WaveCount) const;

+

   unsigned findUnusedRegister(const MachineRegisterInfo &MRI,

                               const TargetRegisterClass *RC) const;

diff --git a/lib/Target/R600/SISchedule.td b/lib/Target/R600/SISchedule.td

index 28b65b8..5a1ae29 100644

--- a/lib/Target/R600/SISchedule.td

+++ b/lib/Target/R600/SISchedule.td

@@ -7,9 +7,81 @@

 //

 //===----------------------------------------------------------------------===//

 //

-// TODO: This is just a place holder for now.

+// MachineModel definitions for Southern Islands (SI)

 //

 //===----------------------------------------------------------------------===//

-

 def SI_Itin : ProcessorItineraries <[], [], []>;

+

+

+def WriteBranch : SchedWrite;

+def WriteExport : SchedWrite;

+def WriteLDS    : SchedWrite;

+def WriteSALU   : SchedWrite;

+def WriteSMEM   : SchedWrite;

+def WriteVMEM   : SchedWrite;

+

+// Vector ALU instructions

+def Write32Bit      : SchedWrite;

+def WriteIntMUL     : SchedWrite;

+

+def WriteConversion : SchedWrite;

+

+def WriteFloatFMA   : SchedWrite;

+def WriteFloatTrans : SchedWrite;

+

+def WriteDouble     : SchedWrite;

+def WriteDoubleAdd  : SchedWrite;

+

+def SIFullSpeedModel : SchedMachineModel;

+

+// BufferSize = 0 means the processors are in-order.

+let BufferSize = 0 in {

+

+// XXX: Are the resource counts correct?

+def HWBranch : ProcResource<1>;

+def HWExport : ProcResource<7>;   // Taken from S_WAITCNT

+def HWLGKM   : ProcResource<31>;  // Taken from S_WAITCNT

+def HWSALU   : ProcResource<1>;

+def HWVMEM   : ProcResource<15>;  // Taken from S_WAITCNT

+def HWVALU   : ProcResource<1>;

+

+}

+

+let SchedModel = SIFullSpeedModel in {

+

+class HWWriteRes<SchedWrite write, list<ProcResourceKind> resources,

+                 int latency> : WriteRes<write, resources> {

+  let Latency = latency;

+}

+

+class HWVALUWriteRes<SchedWrite write, int latency> :

+  HWWriteRes<write, [HWVALU], latency>;

+

+// The latency numbers are taken from AMD Accelerated Parallel Processing

+// guide.  They may not be acurate.

+

+def : HWWriteRes<WriteBranch,  [HWBranch], 100>; // XXX: Guessed ???

+def : HWWriteRes<WriteExport,  [HWExport], 100>; // XXX: Guessed ???</pre>

      </div>

    </blockquote>

    I would assume export is the same as for VMEM?<br>

    <blockquote cite="mid:20141202214751.GA4494@freedesktop.org"

      type="cite">

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">

+def : HWWriteRes<WriteLDS,     [HWLGKM],    32>; // 2 - 64

+def : HWWriteRes<WriteSALU,    [HWSALU],     1>;

+def : HWWriteRes<WriteSMEM,    [HWLGKM],    10>; // XXX: Guessed ???

+def : HWWriteRes<WriteVMEM,    [HWVMEM],   450>; // 300 - 600

+

+// XXX: These definitions assume full double-precision speed, some devices are

+// slower.  These are also taken from the AMD Accelerated Parallel Processing

+// guide and may not be accurate.

+

+// The latency values are 1 / (operations / cycle) / 4.

+def : HWVALUWriteRes<Write32Bit,      1>;

+def : HWVALUWriteRes<WriteIntMUL,     4>;

+

+def : HWVALUWriteRes<WriteConversion, 4>;

+

+def : HWVALUWriteRes<WriteFloatFMA,   1>;  // 16 For single speed

+def : HWVALUWriteRes<WriteFloatTrans, 4>;

+

+def : HWVALUWriteRes<WriteDouble,     4>; // 16 for single speed

+def : HWVALUWriteRes<WriteDoubleAdd,  2>; //  8 for single speed

+

+} // End SchedModel = SIFullSpeedModel

diff --git a/test/CodeGen/R600/atomic_cmp_swap_local.ll b/test/CodeGen/R600/atomic_cmp_swap_local.ll

index 223f4d3..35e8ade 100644

--- a/test/CodeGen/R600/atomic_cmp_swap_local.ll

+++ b/test/CodeGen/R600/atomic_cmp_swap_local.ll

@@ -2,9 +2,9 @@

 ; RUN: llc -march=r600 -mcpu=bonaire -verify-machineinstrs < %s | FileCheck -strict-whitespace -check-prefix=CI -check-prefix=FUNC %s

 ; FUNC-LABEL: {{^}}lds_atomic_cmpxchg_ret_i32_offset:

+; SI: v_mov_b32_e32 [[VCMP:v[0-9]+]], 7

 ; SI: s_load_dword [[PTR:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xb

 ; SI: s_load_dword [[SWAP:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xc

-; SI-DAG: v_mov_b32_e32 [[VCMP:v[0-9]+]], 7

 ; SI-DAG: v_mov_b32_e32 [[VPTR:v[0-9]+]], [[PTR]]

 ; SI-DAG: v_mov_b32_e32 [[VSWAP:v[0-9]+]], [[SWAP]]

 ; SI: ds_cmpst_rtn_b32 [[RESULT:v[0-9]+]], [[VPTR]], [[VCMP]], [[VSWAP]] offset:16 [M0]

@@ -18,11 +18,11 @@ define void @lds_atomic_cmpxchg_ret_i32_offset(i32 addrspace(1)* %out, i32 addrs

 }

 ; FUNC-LABEL: {{^}}lds_atomic_cmpxchg_ret_i64_offset:

-; SI: s_load_dword [[PTR:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xb

-; SI: s_load_dwordx2 s{{\[}}[[LOSWAP:[0-9]+]]:[[HISWAP:[0-9]+]]{{\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0xd

 ; SI: s_mov_b64  s{{\[}}[[LOSCMP:[0-9]+]]:[[HISCMP:[0-9]+]]{{\]}}, 7

 ; SI-DAG: v_mov_b32_e32 v[[LOVCMP:[0-9]+]], s[[LOSCMP]]

 ; SI-DAG: v_mov_b32_e32 v[[HIVCMP:[0-9]+]], s[[HISCMP]]

+; SI: s_load_dword [[PTR:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xb

+; SI: s_load_dwordx2 s{{\[}}[[LOSWAP:[0-9]+]]:[[HISWAP:[0-9]+]]{{\]}}, s{{\[[0-9]+:[0-9]+\]}}, 0xd

 ; SI-DAG: v_mov_b32_e32 [[VPTR:v[0-9]+]], [[PTR]]

 ; SI-DAG: v_mov_b32_e32 v[[LOSWAPV:[0-9]+]], s[[LOSWAP]]

 ; SI-DAG: v_mov_b32_e32 v[[HISWAPV:[0-9]+]], s[[HISWAP]]

diff --git a/test/CodeGen/R600/ctpop.ll b/test/CodeGen/R600/ctpop.ll

index 5cfdaef..eba2e21 100644

--- a/test/CodeGen/R600/ctpop.ll

+++ b/test/CodeGen/R600/ctpop.ll

@@ -38,11 +38,11 @@ define void @v_ctpop_i32(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noali

 }

 ; FUNC-LABEL: {{^}}v_ctpop_add_chain_i32:

-; SI: buffer_load_dword [[VAL0:v[0-9]+]],

 ; SI: buffer_load_dword [[VAL1:v[0-9]+]],

+; SI: buffer_load_dword [[VAL0:v[0-9]+]],

 ; SI: v_mov_b32_e32 [[VZERO:v[0-9]+]], 0

 ; SI: v_bcnt_u32_b32_e32 [[MIDRESULT:v[0-9]+]], [[VAL1]], [[VZERO]]

-; SI-NEXT: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]

+; SI: v_bcnt_u32_b32_e32 [[RESULT:v[0-9]+]], [[VAL0]], [[MIDRESULT]]

 ; SI: buffer_store_dword [[RESULT]],

 ; SI: s_endpgm

diff --git a/test/CodeGen/R600/ds_read2st64.ll b/test/CodeGen/R600/ds_read2st64.ll

index 3e98e59..5e2fa3f 100644

--- a/test/CodeGen/R600/ds_read2st64.ll

+++ b/test/CodeGen/R600/ds_read2st64.ll

@@ -65,8 +65,8 @@ define void @simple_read2st64_f32_max_offset(float addrspace(1)* %out, float add

 ; SI-LABEL: @simple_read2st64_f32_over_max_offset

 ; SI-NOT: ds_read2st64_b32

-; SI: ds_read_b32 {{v[0-9]+}}, {{v[0-9]+}} offset:256

 ; SI: v_add_i32_e32 [[BIGADD:v[0-9]+]], 0x10000, {{v[0-9]+}}

+; SI: ds_read_b32 {{v[0-9]+}}, {{v[0-9]+}} offset:256

 ; SI: ds_read_b32 {{v[0-9]+}}, [[BIGADD]]

 ; SI: s_endpgm

 define void @simple_read2st64_f32_over_max_offset(float addrspace(1)* %out, float addrspace(3)* %lds) #0 {

@@ -197,8 +197,8 @@ define void @simple_read2st64_f64_max_offset(double addrspace(1)* %out, double a

 ; SI-LABEL: @simple_read2st64_f64_over_max_offset

 ; SI-NOT: ds_read2st64_b64

-; SI: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset:512

 ; SI: v_add_i32_e32 [[BIGADD:v[0-9]+]], 0x10000, {{v[0-9]+}}

+; SI: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, {{v[0-9]+}} offset:512

 ; SI: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, [[BIGADD]]

 ; SI: s_endpgm

 define void @simple_read2st64_f64_over_max_offset(double addrspace(1)* %out, double addrspace(3)* %lds) #0 {

diff --git a/test/CodeGen/R600/fceil64.ll b/test/CodeGen/R600/fceil64.ll

index 029f41d..7fcaec2 100644

--- a/test/CodeGen/R600/fceil64.ll

+++ b/test/CodeGen/R600/fceil64.ll

@@ -11,12 +11,12 @@ declare <16 x double> @llvm.ceil.v16f64(<16 x double>) nounwind readnone

 ; FUNC-LABEL: {{^}}fceil_f64:

 ; CI: v_ceil_f64_e32

 ; SI: s_bfe_u32 [[SEXP:s[0-9]+]], {{s[0-9]+}}, 0xb0014

+; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

 ; SI: s_add_i32 s{{[0-9]+}}, [[SEXP]], 0xfffffc01

 ; SI: s_lshr_b64

 ; SI: s_not_b64

 ; SI: s_and_b64

-; SI-DAG: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

-; SI-DAG: cmp_lt_i32

+; SI: cmp_lt_i32

 ; SI: cndmask_b32

 ; SI: cndmask_b32

 ; SI: cmp_gt_i32

diff --git a/test/CodeGen/R600/ffloor.ll b/test/CodeGen/R600/ffloor.ll

index 166f705..2ca428e 100644

--- a/test/CodeGen/R600/ffloor.ll

+++ b/test/CodeGen/R600/ffloor.ll

@@ -12,12 +12,12 @@ declare <16 x double> @llvm.floor.v16f64(<16 x double>) nounwind readnone

 ; CI: v_floor_f64_e32

 ; SI: s_bfe_u32 [[SEXP:s[0-9]+]], {{s[0-9]+}}, 0xb0014

+; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

 ; SI: s_add_i32 s{{[0-9]+}}, [[SEXP]], 0xfffffc01

 ; SI: s_lshr_b64

 ; SI: s_not_b64

 ; SI: s_and_b64

-; SI-DAG: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

-; SI-DAG: cmp_lt_i32

+; SI: cmp_lt_i32

 ; SI: cndmask_b32

 ; SI: cndmask_b32

 ; SI: cmp_gt_i32

diff --git a/test/CodeGen/R600/fmax3.ll b/test/CodeGen/R600/fmax3.ll

index cf371b3..baa38d0 100644

--- a/test/CodeGen/R600/fmax3.ll

+++ b/test/CodeGen/R600/fmax3.ll

@@ -3,9 +3,9 @@

 declare float @llvm.maxnum.f32(float, float) nounwind readnone

 ; SI-LABEL: {{^}}test_fmax3_olt_0:

-; SI: buffer_load_dword [[REGA:v[0-9]+]]

-; SI: buffer_load_dword [[REGB:v[0-9]+]]

 ; SI: buffer_load_dword [[REGC:v[0-9]+]]

+; SI: buffer_load_dword [[REGB:v[0-9]+]]

+; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: v_max3_f32 [[RESULT:v[0-9]+]], [[REGC]], [[REGB]], [[REGA]]

 ; SI: buffer_store_dword [[RESULT]],

 ; SI: s_endpgm

@@ -21,8 +21,8 @@ define void @test_fmax3_olt_0(float addrspace(1)* %out, float addrspace(1)* %apt

 ; Commute operand of second fmax

 ; SI-LABEL: {{^}}test_fmax3_olt_1:

-; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: buffer_load_dword [[REGB:v[0-9]+]]

+; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: buffer_load_dword [[REGC:v[0-9]+]]

 ; SI: v_max3_f32 [[RESULT:v[0-9]+]], [[REGC]], [[REGB]], [[REGA]]

 ; SI: buffer_store_dword [[RESULT]],

diff --git a/test/CodeGen/R600/fmin3.ll b/test/CodeGen/R600/fmin3.ll

index 7420368..78d04b5 100644

--- a/test/CodeGen/R600/fmin3.ll

+++ b/test/CodeGen/R600/fmin3.ll

@@ -3,9 +3,9 @@

 declare float @llvm.minnum.f32(float, float) nounwind readnone

 ; SI-LABEL: {{^}}test_fmin3_olt_0:

-; SI: buffer_load_dword [[REGA:v[0-9]+]]

-; SI: buffer_load_dword [[REGB:v[0-9]+]]

 ; SI: buffer_load_dword [[REGC:v[0-9]+]]

+; SI: buffer_load_dword [[REGB:v[0-9]+]]

+; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: v_min3_f32 [[RESULT:v[0-9]+]], [[REGC]], [[REGB]], [[REGA]]

 ; SI: buffer_store_dword [[RESULT]],

 ; SI: s_endpgm

@@ -21,8 +21,8 @@ define void @test_fmin3_olt_0(float addrspace(1)* %out, float addrspace(1)* %apt

 ; Commute operand of second fmin

 ; SI-LABEL: {{^}}test_fmin3_olt_1:

-; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: buffer_load_dword [[REGB:v[0-9]+]]

+; SI: buffer_load_dword [[REGA:v[0-9]+]]

 ; SI: buffer_load_dword [[REGC:v[0-9]+]]

 ; SI: v_min3_f32 [[RESULT:v[0-9]+]], [[REGC]], [[REGB]], [[REGA]]

 ; SI: buffer_store_dword [[RESULT]],

diff --git a/test/CodeGen/R600/fneg-fabs.f64.ll b/test/CodeGen/R600/fneg-fabs.f64.ll

index 555f4cc..60209a8 100644

--- a/test/CodeGen/R600/fneg-fabs.f64.ll

+++ b/test/CodeGen/R600/fneg-fabs.f64.ll

@@ -56,8 +56,8 @@ define void @fneg_fabs_fn_free_f64(double addrspace(1)* %out, i64 %in) {

 }

 ; FUNC-LABEL: {{^}}fneg_fabs_f64:

-; SI: s_load_dwordx2

 ; SI: s_load_dwordx2 s{{\[}}[[LO_X:[0-9]+]]:[[HI_X:[0-9]+]]{{\]}}

+; SI: s_load_dwordx2

 ; SI: v_mov_b32_e32 [[IMMREG:v[0-9]+]], 0x80000000

 ; SI-DAG: v_or_b32_e32 v[[HI_V:[0-9]+]], s[[HI_X]], [[IMMREG]]

 ; SI-DAG: v_mov_b32_e32 v[[LO_V:[0-9]+]], s[[LO_X]]

diff --git a/test/CodeGen/R600/ftrunc.f64.ll b/test/CodeGen/R600/ftrunc.f64.ll

index fba6154..5547c2f 100644

--- a/test/CodeGen/R600/ftrunc.f64.ll

+++ b/test/CodeGen/R600/ftrunc.f64.ll

@@ -23,12 +23,12 @@ define void @v_ftrunc_f64(double addrspace(1)* %out, double addrspace(1)* %in) {

 ; CI: v_trunc_f64_e32

 ; SI: s_bfe_u32 [[SEXP:s[0-9]+]], {{s[0-9]+}}, 0xb0014

+; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

 ; SI: s_add_i32 s{{[0-9]+}}, [[SEXP]], 0xfffffc01

 ; SI: s_lshr_b64

+; SI: cmp_lt_i32

 ; SI: s_not_b64

 ; SI: s_and_b64

-; SI: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0x80000000

-; SI: cmp_lt_i32

 ; SI: cndmask_b32

 ; SI: cndmask_b32

 ; SI: cmp_gt_i32

diff --git a/test/CodeGen/R600/llvm.memcpy.ll b/test/CodeGen/R600/llvm.memcpy.ll

index 5f2710a..ba4085c 100644

--- a/test/CodeGen/R600/llvm.memcpy.ll

+++ b/test/CodeGen/R600/llvm.memcpy.ll

@@ -6,39 +6,23 @@ declare void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace

 ; FUNC-LABEL: {{^}}test_small_memcpy_i64_lds_to_lds_align1:

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

-

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

+

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

-

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

-; SI: ds_write_b8

 ; SI: ds_read_u8

 ; SI: ds_read_u8

-

 ; SI: ds_read_u8

 ; SI: ds_read_u8

 ; SI: ds_read_u8

@@ -65,6 +49,14 @@ declare void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace

 ; SI: ds_write_b8

 ; SI: ds_write_b8

 ; SI: ds_write_b8

+

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

 ; SI: ds_write_b8

 ; SI: ds_write_b8

@@ -75,6 +67,14 @@ declare void @llvm.memcpy.p1i8.p1i8.i64(i8 addrspace(1)* nocapture, i8 addrspace

 ; SI: ds_write_b8

 ; SI: ds_write_b8

 ; SI: ds_write_b8

+

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

+; SI: ds_write_b8

 ; SI: ds_write_b8

 ; SI: s_endpgm

diff --git a/test/CodeGen/R600/local-atomics.ll b/test/CodeGen/R600/local-atomics.ll

index e9baa08..cbcae60 100644

--- a/test/CodeGen/R600/local-atomics.ll

+++ b/test/CodeGen/R600/local-atomics.ll

@@ -4,8 +4,8 @@

 ; FUNC-LABEL: {{^}}lds_atomic_xchg_ret_i32:

 ; EG: LDS_WRXCHG_RET *

-; SI: s_load_dword [[SPTR:s[0-9]+]],

 ; SI: v_mov_b32_e32 [[DATA:v[0-9]+]], 4

+; SI: s_load_dword [[SPTR:s[0-9]+]],

 ; SI: v_mov_b32_e32 [[VPTR:v[0-9]+]], [[SPTR]]

 ; SI: ds_wrxchg_rtn_b32 [[RESULT:v[0-9]+]], [[VPTR]], [[DATA]] [M0]

 ; SI: buffer_store_dword [[RESULT]],

@@ -30,8 +30,8 @@ define void @lds_atomic_xchg_ret_i32_offset(i32 addrspace(1)* %out, i32 addrspac

 ; XXX - Is it really necessary to load 4 into VGPR?

 ; FUNC-LABEL: {{^}}lds_atomic_add_ret_i32:

 ; EG: LDS_ADD_RET *

-; SI: s_load_dword [[SPTR:s[0-9]+]],

 ; SI: v_mov_b32_e32 [[DATA:v[0-9]+]], 4

+; SI: s_load_dword [[SPTR:s[0-9]+]],

 ; SI: v_mov_b32_e32 [[VPTR:v[0-9]+]], [[SPTR]]

 ; SI: ds_add_rtn_u32 [[RESULT:v[0-9]+]], [[VPTR]], [[DATA]] [M0]

 ; SI: buffer_store_dword [[RESULT]],

diff --git a/test/CodeGen/R600/local-atomics64.ll b/test/CodeGen/R600/local-atomics64.ll

index ce0cf59..8e6d5c1 100644

--- a/test/CodeGen/R600/local-atomics64.ll

+++ b/test/CodeGen/R600/local-atomics64.ll

@@ -29,10 +29,10 @@ define void @lds_atomic_add_ret_i64(i64 addrspace(1)* %out, i64 addrspace(3)* %p

 }

 ; FUNC-LABEL: {{^}}lds_atomic_add_ret_i64_offset:

-; SI: s_load_dword [[PTR:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xb

 ; SI: s_mov_b64 s{{\[}}[[LOSDATA:[0-9]+]]:[[HISDATA:[0-9]+]]{{\]}}, 9

 ; SI-DAG: v_mov_b32_e32 v[[LOVDATA:[0-9]+]], s[[LOSDATA]]

 ; SI-DAG: v_mov_b32_e32 v[[HIVDATA:[0-9]+]], s[[HISDATA]]

+; SI: s_load_dword [[PTR:s[0-9]+]], s{{\[[0-9]+:[0-9]+\]}}, 0xb

 ; SI-DAG: v_mov_b32_e32 [[VPTR:v[0-9]+]], [[PTR]]

 ; SI: ds_add_rtn_u64 [[RESULT:v\[[0-9]+:[0-9]+\]]], [[VPTR]], v{{\[}}[[LOVDATA]]:[[HIVDATA]]{{\]}} offset:32 [M0]

 ; SI: buffer_store_dwordx2 [[RESULT]],

diff --git a/test/CodeGen/R600/local-memory-two-objects.ll b/test/CodeGen/R600/local-memory-two-objects.ll

index 88ef05d..83ab70d 100644

--- a/test/CodeGen/R600/local-memory-two-objects.ll

+++ b/test/CodeGen/R600/local-memory-two-objects.ll

@@ -31,8 +31,8 @@

 ; EG-CHECK-NOT: LDS_READ_RET {{[*]*}} OQAP, T[[ADDRR]]

 ; SI: v_add_i32_e32 [[SIPTR:v[0-9]+]], 16, v{{[0-9]+}}

 ; SI: ds_read_b32 {{v[0-9]+}}, [[SIPTR]] [M0]

-; CI: ds_read_b32 {{v[0-9]+}}, [[ADDRR:v[0-9]+]] offset:16 [M0]

-; CI: ds_read_b32 {{v[0-9]+}}, [[ADDRR]] [M0]

+; CI: ds_read_b32 {{v[0-9]+}}, [[ADDRR:v[0-9]+]] [M0]

+; CI: ds_read_b32 {{v[0-9]+}}, [[ADDRR]] offset:16 [M0]

 define void @local_memory_two_objects(i32 addrspace(1)* %out) {

 entry:

diff --git a/test/CodeGen/R600/si-triv-disjoint-mem-access.ll b/test/CodeGen/R600/si-triv-disjoint-mem-access.ll

index 2c146eb..2e2d551 100644

--- a/test/CodeGen/R600/si-triv-disjoint-mem-access.ll

+++ b/test/CodeGen/R600/si-triv-disjoint-mem-access.ll

@@ -51,8 +51,8 @@ define void @no_reorder_local_load_volatile_global_store_local_load(i32 addrspac

 ; FUNC-LABEL: @no_reorder_barrier_local_load_global_store_local_load

 ; CI: ds_read_b32 {{v[0-9]+}}, {{v[0-9]+}} offset:4

-; CI: buffer_store_dword

 ; CI: ds_read_b32 {{v[0-9]+}}, {{v[0-9]+}} offset:8

+; CI: buffer_store_dword

 define void @no_reorder_barrier_local_load_global_store_local_load(i32 addrspace(1)* %out, i32 addrspace(1)* %gptr) #0 {

   %ptr0 = load i32 addrspace(3)* addrspace(3)* @stored_lds_ptr, align 4

diff --git a/test/CodeGen/R600/smrd.ll b/test/CodeGen/R600/smrd.ll

index 1c7df16..858e3f8 100644

--- a/test/CodeGen/R600/smrd.ll

+++ b/test/CodeGen/R600/smrd.ll

@@ -37,10 +37,12 @@ entry:

 ; SMRD load with a 64-bit offset

 ; CHECK-LABEL: {{^}}smrd3:

-; CHECK-DAG: s_mov_b32 s[[SHI:[0-9]+]], 4

-; CHECK-DAG: s_mov_b32 s[[SLO:[0-9]+]], 0 ;

-; FIXME: We don't need to copy these values to VGPRs

-; CHECK-DAG: v_mov_b32_e32 v[[VLO:[0-9]+]], s[[SLO]]

+; FIXME: There are too many copies here because we don't fold immediates

+;        through REG_SEQUENCE

+; CHECK: s_mov_b32 s[[SLO:[0-9]+]], 0 ;

+; CHECK: s_mov_b32 s[[SHI:[0-9]+]], 4

+; CHECK: s_mov_b32 s[[SSLO:[0-9]+]], s[[SLO]]

+; CHECK-DAG: v_mov_b32_e32 v[[VLO:[0-9]+]], s[[SSLO]]

 ; CHECK-DAG: v_mov_b32_e32 v[[VHI:[0-9]+]], s[[SHI]]

 ; FIXME: We should be able to use s_load_dword here

 ; CHECK: buffer_load_dword v{{[0-9]+}}, v{{\[}}[[VLO]]:[[VHI]]{{\]}}, s[{{[0-9]+:[0-9]+}}], 0 addr64

diff --git a/test/CodeGen/R600/wait.ll b/test/CodeGen/R600/wait.ll

index 735eabd..cfb446f 100644

--- a/test/CodeGen/R600/wait.ll

+++ b/test/CodeGen/R600/wait.ll

@@ -3,9 +3,8 @@

 ; CHECK-LABEL: {{^}}main:

 ; CHECK: s_load_dwordx4

 ; CHECK: s_load_dwordx4

-; CHECK: s_waitcnt lgkmcnt(0){{$}}

-; CHECK: s_waitcnt vmcnt(0){{$}}

-; CHECK: s_waitcnt expcnt(0) lgkmcnt(0){{$}}

+; CHECK: s_waitcnt vmcnt(0) lgkmcnt(0){{$}}

+; CHECK: s_endpgm

 define void @main(<16 x i8> addrspace(2)* inreg %arg, <16 x i8> addrspace(2)* inreg %arg1, <32 x i8> addrspace(2)* inreg %arg2, <16 x i8> addrspace(2)* inreg %arg3, <16 x i8> addrspace(2)* inreg %arg4, i32 inreg %arg5, i32 %arg6, i32 %arg7, i32 %arg8, i32 %arg9, float addrspace(2)* inreg %constptr) #0 {

 main_body:

   %tmp = getelementptr <16 x i8> addrspace(2)* %arg3, i32 0

diff --git a/test/CodeGen/R600/zero_extend.ll b/test/CodeGen/R600/zero_extend.ll

index 0fe1f15..1c746b0 100644

--- a/test/CodeGen/R600/zero_extend.ll

+++ b/test/CodeGen/R600/zero_extend.ll

@@ -29,9 +29,9 @@ entry:

 }

 ; SI-CHECK-LABEL: {{^}}zext_i1_to_i64:

+; SI-CHECK: s_mov_b32 s{{[0-9]+}}, 0

 ; SI-CHECK: v_cmp_eq_i32

 ; SI-CHECK: v_cndmask_b32

-; SI-CHECK: s_mov_b32 s{{[0-9]+}}, 0

 define void @zext_i1_to_i64(i64 addrspace(1)* %out, i32 %a, i32 %b) nounwind {

   %cmp = icmp eq i32 %a, %b

   %ext = zext i1 %cmp to i64

<div class="moz-txt-sig">-- 

2.0.4

</div></pre>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <div class="moz-text-plain" wrap="true" graphical-quote="true"

        style="font-family: -moz-fixed; font-size: 12px;"

        lang="x-western">

        <pre wrap="">_______________________________________________

llvm-commits mailing list

<a moz-do-not-send="true" class="moz-txt-link-abbreviated" href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a>

</pre>

      </div>

    </blockquote>

    <br>

  </body>

</html>