PATCH: R600/SI: Bug fixes for mi-sched

Matt Arsenault Matthew.Arsenault at amd.com
Thu Nov 13 13:58:58 PST 2014


On 11/13/2014 07:50 AM, Tom Stellard wrote:
> Hi,
>
> The attached patches fix some bugs uncovered while implementing the mi
> scheduler for SI.  Most of these fixes involve changing how we store
> values in the m0 register, in order to avoid incorrectly clobbering values in
> that register.
>
> -Tom
>
> 0001-R600-SI-Fix-spilling-of-m0-register.patch
>
>
>  From 121a8a47373ea629c2742595b63b6dfb8d597148 Mon Sep 17 00:00:00 2001
> From: Tom Stellard<thomas.stellard at amd.com>
> Date: Thu, 6 Nov 2014 14:02:41 -0500
> Subject: [PATCH 1/4] R600/SI: Fix spilling of m0 register
>
> If we have spilled the value of the m0 register, then we need to restore
> it with v_readlane_b32 to a regular sgpr, because v_readlane_b32 can't
> write to m0.
>
> v_readlane_b32 can't write to m0, so
> ---
>   lib/Target/R600/SIRegisterInfo.cpp | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/lib/Target/R600/SIRegisterInfo.cpp b/lib/Target/R600/SIRegisterInfo.cpp
> index 3715e6a..6f79a97 100644
> --- a/lib/Target/R600/SIRegisterInfo.cpp
> +++ b/lib/Target/R600/SIRegisterInfo.cpp
> @@ -145,6 +145,7 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
>         for (unsigned i = 0, e = NumSubRegs; i < e; ++i) {
>           unsigned SubReg = getPhysRegSubReg(MI->getOperand(0).getReg(),
>                                              &AMDGPU::SGPR_32RegClass, i);
> +        bool isM0 = SubReg == AMDGPU::M0;
>           struct SIMachineFunctionInfo::SpilledReg Spill =
>               MFI->getSpilledReg(MF, Index, i);
>   
> @@ -153,10 +154,17 @@ void SIRegisterInfo::eliminateFrameIndex(MachineBasicBlock::iterator MI,
>              Ctx.emitError("Ran out of VGPRs for spilling SGPR");
>           }
>   
> +        if (isM0) {
> +          SubReg = RS->scavengeRegister(&AMDGPU::SGPR_32RegClass, MI, 0);
> +        }
> +
>           BuildMI(*MBB, MI, DL, TII->get(AMDGPU::V_READLANE_B32), SubReg)
>                   .addReg(Spill.VGPR)
>                   .addImm(Spill.Lane);
> -
> +        if (isM0) {
> +          BuildMI(*MBB, MI, DL, TII->get(AMDGPU::S_MOV_B32), AMDGPU::M0)
> +                  .addReg(SubReg);
> +        }
>         }
>         TII->insertNOPs(MI, 3);
>         MI->eraseFromParent();
> -- 1.8.5.5

LGTM. Can you add a test for this?
>
> 0002-R600-SI-Mark-s_movk_i32-as-rematerializable.patch
>
>
>  From b2d846dea6678b541d192da3aa5dce23aeb06cf2 Mon Sep 17 00:00:00 2001
> From: Tom Stellard<thomas.stellard at amd.com>
> Date: Thu, 6 Nov 2014 14:01:30 -0500
> Subject: [PATCH 2/4] R600/SI: Mark s_movk_i32 as rematerializable
>
> ---
>   lib/Target/R600/SIInstructions.td | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> index af8f1b2..22d30c2 100644
> --- a/lib/Target/R600/SIInstructions.td
> +++ b/lib/Target/R600/SIInstructions.td
> @@ -333,7 +333,9 @@ def S_CMP_LE_U32 : SOPC_32 <0x0000000b, "s_cmp_le_u32">;
>   // SOPK Instructions
>   //===----------------------------------------------------------------------===//
>   
> +let isReMaterializable = 1 in {
>   def S_MOVK_I32 : SOPK_32 <0x00000000, "s_movk_i32", []>;
> +} // End isReMaterializable = 1
>   def S_CMOVK_I32 : SOPK_32 <0x00000002, "s_cmovk_i32", []>;
>   
>   /*
> -- 1.8.5.5

LGTM

>
> 0003-R600-SI-Emit-s_movk_i32-m0-1-before-every-DS-instruc.patch
>
>
>  From adf2999e856aaa4267196c1b3d30a941a6e7fe0c Mon Sep 17 00:00:00 2001
> From: Tom Stellard<thomas.stellard at amd.com>
> Date: Fri, 31 Oct 2014 13:10:22 -0400
> Subject: [PATCH 3/4] R600/SI: Emit s_movk_i32 m0, -1 before every DS
>   instruction
>
> The s_movk_i32 will write to a virtual register from the M0Reg
> class and all the ds instructions now take an extra M0Reg explicit
> argument.
>
> This change is necessary to prevent issues with the scheduler
> mixing together instructions that expect different values in the m0
> registers.
I don't really understand the problem. Does the scheduler not handle 
implicit uses correctly or something? Are the extra
initializations to every selected DS instruction optimized into only one 
at the beginning of the kernel? This seems like it might complicate
using other features that require manipulating M0, like the direct to 
LDS buffer loads.


> ---
>   lib/Target/R600/SIISelLowering.cpp       |  2 +-
>   lib/Target/R600/SIInstrFormats.td        |  1 +
>   lib/Target/R600/SIInstrInfo.td           | 17 +++++++++--------
>   lib/Target/R600/SIInstructions.td        | 15 ++++++++-------
>   lib/Target/R600/SILoadStoreOptimizer.cpp | 11 ++++++++++-
>   lib/Target/R600/SILowerControlFlow.cpp   | 23 -----------------------
>   test/CodeGen/R600/ds_read2.ll            | 30 +++++++++++++++---------------
>   test/CodeGen/R600/shl_add_ptr.ll         |  1 +
>   8 files changed, 45 insertions(+), 55 deletions(-)
>
> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> index 207539f..10d823b 100644
> --- a/lib/Target/R600/SIISelLowering.cpp
> +++ b/lib/Target/R600/SIISelLowering.cpp
> @@ -1933,6 +1933,7 @@ void SITargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,
>     const SIInstrInfo *TII = static_cast<const SIInstrInfo *>(
>         getTargetMachine().getSubtargetImpl()->getInstrInfo());
>   
> +  MachineRegisterInfo &MRI = MI->getParent()->getParent()->getRegInfo();
>     TII->legalizeOperands(MI);
>   
>     if (TII->isMIMG(MI->getOpcode())) {
> @@ -1952,7 +1953,6 @@ void SITargetLowering::AdjustInstrPostInstrSelection(MachineInstr *MI,
>   
>       unsigned NewOpcode = TII->getMaskedMIMGOp(MI->getOpcode(), BitsSet);
>       MI->setDesc(TII->get(NewOpcode));
> -    MachineRegisterInfo &MRI = MI->getParent()->getParent()->getRegInfo();
>       MRI.setRegClass(VReg, RC);
>       return;
>     }
> diff --git a/lib/Target/R600/SIInstrFormats.td b/lib/Target/R600/SIInstrFormats.td
> index 18c1345..d524baf 100644
> --- a/lib/Target/R600/SIInstrFormats.td
> +++ b/lib/Target/R600/SIInstrFormats.td
> @@ -545,6 +545,7 @@ class DS <bits<8> op, dag outs, dag ins, string asm, list<dag> pattern> :
>   
>     let LGKM_CNT = 1;
>     let UseNamedOperandTable = 1;
> +  let DisableEncoding = "$m0";
>   }
>   
>   class MUBUF <bits<7> op, dag outs, dag ins, string asm, list<dag> pattern> :
> diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
> index ec4ac9a..35dd609 100644
> --- a/lib/Target/R600/SIInstrInfo.td
> +++ b/lib/Target/R600/SIInstrInfo.td
> @@ -935,7 +935,7 @@ class DS_1A <bits<8> op, dag outs, dag ins, string asm, list<dag> pat> :
>   class DS_Load_Helper <bits<8> op, string asm, RegisterClass regClass> : DS_1A <
>     op,
>     (outs regClass:$vdst),
> -  (ins i1imm:$gds, VReg_32:$addr, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, ds_offset:$offset, M0Reg:$m0),
>     asm#" $vdst, $addr"#"$offset"#" [M0]",
>     []> {
>     let data0 = 0;
> @@ -947,7 +947,8 @@ class DS_Load_Helper <bits<8> op, string asm, RegisterClass regClass> : DS_1A <
>   class DS_Load2_Helper <bits<8> op, string asm, RegisterClass regClass> : DS <
>     op,
>     (outs regClass:$vdst),
> -  (ins i1imm:$gds, VReg_32:$addr, ds_offset0:$offset0, ds_offset1:$offset1),
> +  (ins i1imm:$gds, VReg_32:$addr, ds_offset0:$offset0, ds_offset1:$offset1,
> +        M0Reg:$m0),
>     asm#" $vdst, $addr"#"$offset0"#"$offset1 [M0]",
>     []> {
>     let data0 = 0;
> @@ -959,7 +960,7 @@ class DS_Load2_Helper <bits<8> op, string asm, RegisterClass regClass> : DS <
>   class DS_Store_Helper <bits<8> op, string asm, RegisterClass regClass> : DS_1A <
>     op,
>     (outs),
> -  (ins i1imm:$gds, VReg_32:$addr, regClass:$data0, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, regClass:$data0, ds_offset:$offset, M0Reg:$m0),
>     asm#" $addr, $data0"#"$offset"#" [M0]",
>     []> {
>     let data1 = 0;
> @@ -972,7 +973,7 @@ class DS_Store2_Helper <bits<8> op, string asm, RegisterClass regClass> : DS <
>     op,
>     (outs),
>     (ins i1imm:$gds, VReg_32:$addr, regClass:$data0, regClass:$data1,
> -       ds_offset0:$offset0, ds_offset1:$offset1),
> +       ds_offset0:$offset0, ds_offset1:$offset1, M0Reg:$m0),
>     asm#" $addr, $data0, $data1"#"$offset0"#"$offset1 [M0]",
>     []> {
>     let mayStore = 1;
> @@ -984,7 +985,7 @@ class DS_Store2_Helper <bits<8> op, string asm, RegisterClass regClass> : DS <
>   class DS_1A1D_RET <bits<8> op, string asm, RegisterClass rc, string noRetOp = ""> : DS_1A <
>     op,
>     (outs rc:$vdst),
> -  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, ds_offset:$offset, M0Reg:$m0),
>     asm#" $vdst, $addr, $data0"#"$offset"#" [M0]", []>,
>     AtomicNoRet<noRetOp, 1> {
>   
> @@ -999,7 +1000,7 @@ class DS_1A1D_RET <bits<8> op, string asm, RegisterClass rc, string noRetOp = ""
>   class DS_1A2D_RET <bits<8> op, string asm, RegisterClass rc, string noRetOp = ""> : DS_1A <
>     op,
>     (outs rc:$vdst),
> -  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, rc:$data1, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, rc:$data1, ds_offset:$offset, M0Reg:$m0),
>     asm#" $vdst, $addr, $data0, $data1"#"$offset"#" [M0]",
>     []>,
>     AtomicNoRet<noRetOp, 1> {
> @@ -1013,7 +1014,7 @@ class DS_1A2D_RET <bits<8> op, string asm, RegisterClass rc, string noRetOp = ""
>   class DS_1A2D_NORET <bits<8> op, string asm, RegisterClass rc, string noRetOp = asm> : DS_1A <
>     op,
>     (outs),
> -  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, rc:$data1, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, rc:$data1, ds_offset:$offset, M0Reg:$m0),
>     asm#" $addr, $data0, $data1"#"$offset"#" [M0]",
>     []>,
>     AtomicNoRet<noRetOp, 0> {
> @@ -1025,7 +1026,7 @@ class DS_1A2D_NORET <bits<8> op, string asm, RegisterClass rc, string noRetOp =
>   class DS_1A1D_NORET <bits<8> op, string asm, RegisterClass rc, string noRetOp = asm> : DS_1A <
>     op,
>     (outs),
> -  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, ds_offset:$offset),
> +  (ins i1imm:$gds, VReg_32:$addr, rc:$data0, ds_offset:$offset, M0Reg:$m0),
>     asm#" $addr, $data0"#"$offset"#" [M0]",
>     []>,
>     AtomicNoRet<noRetOp, 0> {
> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> index 22d30c2..66468af 100644
> --- a/lib/Target/R600/SIInstructions.td
> +++ b/lib/Target/R600/SIInstructions.td
> @@ -2599,7 +2599,7 @@ def : ROTRPattern <V_ALIGNBIT_B32>;
>   
>   class DSReadPat <DS inst, ValueType vt, PatFrag frag> : Pat <
>     (vt (frag (DS1Addr1Offset i32:$ptr, i32:$offset))),
> -  (inst (i1 0), $ptr, (as_i16imm $offset))
> +  (inst (i1 0), $ptr, (as_i16imm $offset), (S_MOVK_I32 -1))
>   >;
>   
>   def : DSReadPat <DS_READ_I8,  i32, sextloadi8_local>;
> @@ -2617,12 +2617,12 @@ def : DSReadPat <DS_READ_B64, v2i32, local_load_aligned8bytes>;
>   def : Pat <
>     (v2i32 (local_load (DS64Bit4ByteAligned i32:$ptr, i8:$offset0,
>                                                       i8:$offset1))),
> -  (DS_READ2_B32 (i1 0), $ptr, $offset0, $offset1)
> +  (DS_READ2_B32 (i1 0), $ptr, $offset0, $offset1, (S_MOVK_I32 -1))
>   >;
This should use a regular S_MOV_B32. Since -1 is a special inline 
immediate, you aren't getting the size benefit from this, and we should 
prefer using the normal mov instructions.

>   
>   class DSWritePat <DS inst, ValueType vt, PatFrag frag> : Pat <
>     (frag vt:$value, (DS1Addr1Offset i32:$ptr, i32:$offset)),
> -  (inst (i1 0), $ptr, $value, (as_i16imm $offset))
> +  (inst (i1 0), $ptr, $value, (as_i16imm $offset), (S_MOVK_I32 -1))
>   >;
>   
>   def : DSWritePat <DS_WRITE_B8, i32, truncstorei8_local>;
> @@ -2638,12 +2638,13 @@ def : Pat <
>     (local_store v2i32:$value, (DS64Bit4ByteAligned i32:$ptr, i8:$offset0,
>                                                               i8:$offset1)),
>     (DS_WRITE2_B32 (i1 0), $ptr, (EXTRACT_SUBREG $value, sub0),
> -                        (EXTRACT_SUBREG $value, sub1), $offset0, $offset1)
> +                        (EXTRACT_SUBREG $value, sub1), $offset0, $offset1,
> +                        (S_MOVK_I32 -1))
>   >;
>   
>   class DSAtomicRetPat<DS inst, ValueType vt, PatFrag frag> : Pat <
>     (frag (DS1Addr1Offset i32:$ptr, i32:$offset), vt:$value),
> -  (inst (i1 0), $ptr, $value, (as_i16imm $offset))
> +  (inst (i1 0), $ptr, $value, (as_i16imm $offset), (S_MOVK_I32 -1))
>   >;
>   
>   // Special case of DSAtomicRetPat for add / sub 1 -> inc / dec
> @@ -2659,13 +2660,13 @@ class DSAtomicRetPat<DS inst, ValueType vt, PatFrag frag> : Pat <
>   class DSAtomicIncRetPat<DS inst, ValueType vt,
>                           Instruction LoadImm, PatFrag frag> : Pat <
>     (frag (DS1Addr1Offset i32:$ptr, i32:$offset), (vt 1)),
> -  (inst (i1 0), $ptr, (LoadImm (vt -1)), (as_i16imm $offset))
> +  (inst (i1 0), $ptr, (LoadImm (vt -1)), (as_i16imm $offset), (S_MOVK_I32 -1))
>   >;
>   
>   
>   class DSAtomicCmpXChg <DS inst, ValueType vt, PatFrag frag> : Pat <
>     (frag (DS1Addr1Offset i32:$ptr, i32:$offset), vt:$cmp, vt:$swap),
> -  (inst (i1 0), $ptr, $cmp, $swap, (as_i16imm $offset))
> +  (inst (i1 0), $ptr, $cmp, $swap, (as_i16imm $offset), (S_MOVK_I32 -1))
>   >;
>   
>   
> diff --git a/lib/Target/R600/SILoadStoreOptimizer.cpp b/lib/Target/R600/SILoadStoreOptimizer.cpp
> index 4140196..2536cca 100644
> --- a/lib/Target/R600/SILoadStoreOptimizer.cpp
> +++ b/lib/Target/R600/SILoadStoreOptimizer.cpp
> @@ -222,6 +222,7 @@ MachineBasicBlock::iterator  SILoadStoreOptimizer::mergeRead2Pair(
>     // Be careful, since the addresses could be subregisters themselves in weird
>     // cases, like vectors of pointers.
>     const MachineOperand *AddrReg = TII->getNamedOperand(*I, AMDGPU::OpName::addr);
> +  const MachineOperand *M0Reg = TII->getNamedOperand(*I, AMDGPU::OpName::m0);
>   
>     unsigned DestReg0 = TII->getNamedOperand(*I, AMDGPU::OpName::vdst)->getReg();
>     unsigned DestReg1
> @@ -262,6 +263,7 @@ MachineBasicBlock::iterator  SILoadStoreOptimizer::mergeRead2Pair(
>       .addOperand(*AddrReg) // addr
>       .addImm(NewOffset0) // offset0
>       .addImm(NewOffset1) // offset1
> +    .addOperand(*M0Reg) // M0
>       .addMemOperand(*I->memoperands_begin())
>       .addMemOperand(*Paired->memoperands_begin());
>   
> @@ -280,6 +282,10 @@ MachineBasicBlock::iterator  SILoadStoreOptimizer::mergeRead2Pair(
>     LiveInterval &AddrRegLI = LIS->getInterval(AddrReg->getReg());
>     LIS->shrinkToUses(&AddrRegLI);
>   
> +  LiveInterval &M0RegLI = LIS->getInterval(M0Reg->getReg());
> +  LIS->shrinkToUses(&M0RegLI);
> +
> +
Extra blank line

>     LIS->getInterval(DestReg); // Create new LI
>   
>     DEBUG(dbgs() << "Inserted read2: " << *Read2 << '\n');
> @@ -295,6 +301,7 @@ MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(
>     // Be sure to use .addOperand(), and not .addReg() with these. We want to be
>     // sure we preserve the subregister index and any register flags set on them.
>     const MachineOperand *Addr = TII->getNamedOperand(*I, AMDGPU::OpName::addr);
> +  const MachineOperand *M0Reg = TII->getNamedOperand(*I, AMDGPU::OpName::m0);
>     const MachineOperand *Data0 = TII->getNamedOperand(*I, AMDGPU::OpName::data0);
>     const MachineOperand *Data1
>       = TII->getNamedOperand(*Paired, AMDGPU::OpName::data0);
> @@ -333,11 +340,13 @@ MachineBasicBlock::iterator SILoadStoreOptimizer::mergeWrite2Pair(
>       .addOperand(*Data1) // data1
>       .addImm(NewOffset0) // offset0
>       .addImm(NewOffset1) // offset1
> +    .addOperand(*M0Reg)  // m0
>       .addMemOperand(*I->memoperands_begin())
>       .addMemOperand(*Paired->memoperands_begin());
>   
>     // XXX - How do we express subregisters here?
> -  unsigned OrigRegs[] = { Data0->getReg(), Data1->getReg(), Addr->getReg() };
> +  unsigned OrigRegs[] = { Data0->getReg(), Data1->getReg(), Addr->getReg(),
> +                          M0Reg->getReg()};
>   
>     LIS->RemoveMachineInstrFromMaps(I);
>     LIS->RemoveMachineInstrFromMaps(Paired);
> diff --git a/lib/Target/R600/SILowerControlFlow.cpp b/lib/Target/R600/SILowerControlFlow.cpp
> index 59270ee..ff44b28 100644
> --- a/lib/Target/R600/SILowerControlFlow.cpp
> +++ b/lib/Target/R600/SILowerControlFlow.cpp
> @@ -88,7 +88,6 @@ private:
>     void Kill(MachineInstr &MI);
>     void Branch(MachineInstr &MI);
>   
> -  void InitM0ForLDS(MachineBasicBlock::iterator MI);
>     void LoadM0(MachineInstr &MI, MachineInstr *MovRel);
>     void IndirectSrc(MachineInstr &MI);
>     void IndirectDst(MachineInstr &MI);
> @@ -325,14 +324,6 @@ void SILowerControlFlowPass::Kill(MachineInstr &MI) {
>     MI.eraseFromParent();
>   }
>   
> -/// The m0 register stores the maximum allowable address for LDS reads and
> -/// writes.  Its value must be at least the size in bytes of LDS allocated by
> -/// the shader.  For simplicity, we set it to the maximum possible value.
> -void SILowerControlFlowPass::InitM0ForLDS(MachineBasicBlock::iterator MI) {
> -    BuildMI(*MI->getParent(), MI, MI->getDebugLoc(),  TII->get(AMDGPU::S_MOV_B32),
> -            AMDGPU::M0).addImm(0xffffffff);
> -}
> -
>   void SILowerControlFlowPass::LoadM0(MachineInstr &MI, MachineInstr *MovRel) {
>   
>     MachineBasicBlock &MBB = *MI.getParent();
> @@ -391,12 +382,6 @@ void SILowerControlFlowPass::LoadM0(MachineInstr &MI, MachineInstr *MovRel) {
>               .addReg(Save);
>   
>     }
> -  // FIXME: Are there any values other than the LDS address clamp that need to
> -  // be stored in the m0 register and may be live for more than a few
> -  // instructions?  If so, we should save the m0 register at the beginning
> -  // of this function and restore it here.
> -  // FIXME: Add support for LDS direct loads.
> -  InitM0ForLDS(&MI);
>     MI.eraseFromParent();
>   }
>   
> @@ -465,7 +450,6 @@ bool SILowerControlFlowPass::runOnMachineFunction(MachineFunction &MF) {
>   
>         MachineInstr &MI = *I;
>         if (TII->isDS(MI.getOpcode())) {
> -        NeedM0 = true;
>           NeedWQM = true;
>         }
>   
> @@ -544,13 +528,6 @@ bool SILowerControlFlowPass::runOnMachineFunction(MachineFunction &MF) {
>       }
>     }
>   
> -  if (NeedM0) {
> -    MachineBasicBlock &MBB = MF.front();
> -    // Initialize M0 to a value that won't cause LDS access to be discarded
> -    // due to offset clamping
> -    InitM0ForLDS(MBB.getFirstNonPHI());
> -  }
> -
>     if (NeedWQM && MFI->getShaderType() == ShaderType::PIXEL) {
>       MachineBasicBlock &MBB = MF.front();
>       BuildMI(MBB, MBB.getFirstNonPHI(), DebugLoc(), TII->get(AMDGPU::S_WQM_B64),
> diff --git a/test/CodeGen/R600/ds_read2.ll b/test/CodeGen/R600/ds_read2.ll
> index b431b5f..75d1999 100644
> --- a/test/CodeGen/R600/ds_read2.ll
> +++ b/test/CodeGen/R600/ds_read2.ll
> @@ -26,9 +26,9 @@ define void @simple_read2_f32(float addrspace(1)* %out) #0 {
>   }
>   
>   ; SI-LABEL: @simple_read2_f32_max_offset
> -; SI: ds_read2_b32 v{{\[}}[[LO_VREG:[0-9]+]]:[[HI_VREG:[0-9]+]]{{\]}}, v{{[0-9]+}} offset0:0 offset1:255
> +; SI: ds_read2_b32 v{{\[}}[[LO_VREG:[0-9]+]]:[[HI_VREG:[0-9]+]]{{\]}}, v{{[0-9]+}} offset0:255 offset1:0
Why do the offsets swap? I know there are a few asserts that the second 
offset is always > the first and some uses assume that's true. I don't 
think it makes any difference,
so I would prefer to keep only ever emitting offset1 > offset0 to 
simplify other checks which inspect the offsets if possible.
>   ; SI: s_waitcnt lgkmcnt(0)
> -; SI: v_add_f32_e32 [[RESULT:v[0-9]+]], v[[HI_VREG]], v[[LO_VREG]]
> +; SI: v_add_f32_e32 [[RESULT:v[0-9]+]], v[[LO_VREG]], v[[HI_VREG]]
>   ; SI: buffer_store_dword [[RESULT]]
>   ; SI: s_endpgm
>   define void @simple_read2_f32_max_offset(float addrspace(1)* %out) #0 {
> @@ -46,8 +46,8 @@ define void @simple_read2_f32_max_offset(float addrspace(1)* %out) #0 {
>   
>   ; SI-LABEL: @simple_read2_f32_too_far
>   ; SI-NOT ds_read2_b32
> -; SI: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}}
> -; SI: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:1028
> +; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}}
> +; SI-DAG: ds_read_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:1028
>   ; SI: s_endpgm
>   define void @simple_read2_f32_too_far(float addrspace(1)* %out) #0 {
>     %x.i = tail call i32 @llvm.r600.read.tidig.x() #1
> @@ -63,8 +63,8 @@ define void @simple_read2_f32_too_far(float addrspace(1)* %out) #0 {
>   }
>   
>   ; SI-LABEL: @simple_read2_f32_x2
> -; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR:v[0-9]+]] offset0:0 offset1:8
> -; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:11 offset1:27
> +; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR:v[0-9]+]] offset0:27 offset1:0
> +; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:8 offset1:11
>   ; SI: s_endpgm
>   define void @simple_read2_f32_x2(float addrspace(1)* %out) #0 {
>     %tid.x = tail call i32 @llvm.r600.read.tidig.x() #1
> @@ -96,7 +96,7 @@ define void @simple_read2_f32_x2(float addrspace(1)* %out) #0 {
>   ; SI-LABEL: @simple_read2_f32_x2_barrier
>   ; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR:v[0-9]+]] offset0:0 offset1:8
>   ; SI: s_barrier
> -; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:11 offset1:27
> +; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:27 offset1:11
>   ; SI: s_endpgm
>   define void @simple_read2_f32_x2_barrier(float addrspace(1)* %out) #0 {
>     %tid.x = tail call i32 @llvm.r600.read.tidig.x() #1
> @@ -130,8 +130,8 @@ define void @simple_read2_f32_x2_barrier(float addrspace(1)* %out) #0 {
>   ; element results in only folding the inner pair.
>   
>   ; SI-LABEL: @simple_read2_f32_x2_nonzero_base
> -; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR:v[0-9]+]] offset0:2 offset1:8
> -; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:11 offset1:27
> +; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR:v[0-9]+]] offset0:27 offset1:2
> +; SI: ds_read2_b32 v{{\[[0-9]+:[0-9]+\]}}, [[BASEADDR]] offset0:8 offset1:11
>   ; SI: s_endpgm
>   define void @simple_read2_f32_x2_nonzero_base(float addrspace(1)* %out) #0 {
>     %tid.x = tail call i32 @llvm.r600.read.tidig.x() #1
> @@ -313,8 +313,8 @@ define void @misaligned_2_simple_read2_f32(float addrspace(1)* %out, float addrs
>   
>   ; SI-LABEL: @simple_read2_f64
>   ; SI: v_lshlrev_b32_e32 [[VPTR:v[0-9]+]], 3, {{v[0-9]+}}
> -; SI: ds_read2_b64 v{{\[}}[[LO_VREG:[0-9]+]]:[[HI_VREG:[0-9]+]]{{\]}}, [[VPTR]] offset0:0 offset1:8
> -; SI: v_add_f64 [[RESULT:v\[[0-9]+:[0-9]+\]]], v{{\[}}[[LO_VREG]]:{{[0-9]+\]}}, v{{\[[0-9]+}}:[[HI_VREG]]{{\]}}
> +; SI: ds_read2_b64 v{{\[}}[[LO_VREG:[0-9]+]]:[[HI_VREG:[0-9]+]]{{\]}}, [[VPTR]] offset0:8 offset1:0
> +; SI: v_add_f64 [[RESULT:v\[[0-9]+:[0-9]+\]]], v{{\[[0-9]+}}:[[HI_VREG]]{{\]}}, v{{\[}}[[LO_VREG]]:{{[0-9]+\]}}
>   ; SI: buffer_store_dwordx2 [[RESULT]]
>   ; SI: s_endpgm
>   define void @simple_read2_f64(double addrspace(1)* %out) #0 {
> @@ -331,7 +331,7 @@ define void @simple_read2_f64(double addrspace(1)* %out) #0 {
>   }
>   
>   ; SI-LABEL: @simple_read2_f64_max_offset
> -; SI: ds_read2_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}} offset0:0 offset1:255
> +; SI: ds_read2_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}} offset0:255 offset1:0
>   ; SI: s_endpgm
>   define void @simple_read2_f64_max_offset(double addrspace(1)* %out) #0 {
>     %x.i = tail call i32 @llvm.r600.read.tidig.x() #1
> @@ -348,9 +348,9 @@ define void @simple_read2_f64_max_offset(double addrspace(1)* %out) #0 {
>   
>   ; SI-LABEL: @simple_read2_f64_too_far
>   ; SI-NOT ds_read2_b64
> -; SI: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}}
> -; SI: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}} offset:2056
> -; SI: s_endpgm
> +; SI-DAG: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}}
> +; SI-DAG: ds_read_b64 {{v\[[0-9]+:[0-9]+\]}}, v{{[0-9]+}} offset:2056
> +; SI: S_ENDPGM
>   define void @simple_read2_f64_too_far(double addrspace(1)* %out) #0 {
>     %x.i = tail call i32 @llvm.r600.read.tidig.x() #1
>     %arrayidx0 = getelementptr inbounds [512 x double] addrspace(3)* @lds.f64, i32 0, i32 %x.i
> diff --git a/test/CodeGen/R600/shl_add_ptr.ll b/test/CodeGen/R600/shl_add_ptr.ll
> index e6dfc58..0f2dac3 100644
> --- a/test/CodeGen/R600/shl_add_ptr.ll
> +++ b/test/CodeGen/R600/shl_add_ptr.ll
> @@ -69,6 +69,7 @@ define void @load_shl_base_lds_max_offset(i8 addrspace(1)* %out, i8 addrspace(3)
>   
>   ; SI-LABEL: {{^}}load_shl_base_lds_2:
>   ; SI: v_lshlrev_b32_e32 [[PTR:v[0-9]+]], 2, {{v[0-9]+}}
> +; SI-NEXT: s_movk_i32 m0, -1
>   ; SI-NEXT: ds_read2st64_b32 {{v\[[0-9]+:[0-9]+\]}}, [[PTR]] offset0:1 offset1:9 [M0]
>   ; SI: s_endpgm
>   define void @load_shl_base_lds_2(float addrspace(1)* %out) #0 {
> -- 1.8.5.5
>
> 0004-R600-SI-Add-an-s_mov_b32-to-patterns-which-use-the-M.patch
>
>
>  From 45660e464b02bae827d170d8b94862ac23058105 Mon Sep 17 00:00:00 2001
> From: Tom Stellard<thomas.stellard at amd.com>
> Date: Fri, 31 Oct 2014 13:15:24 -0400
> Subject: [PATCH 4/4] R600/SI: Add an s_mov_b32 to patterns which use the
>   M0RegClass
>
> Without an s_mov_b32 instruction in the pattern, the InstrEmitter will
> emit COPY instruction which CSE will ignore.  We were previously trying
> to CSE these copies in SIInstrInfo::copyPhysReg(), but since this was
> happening after register allocation, so if the scheduler group a
> whole bunch for s_mov_b32 m0, ... instruction together, then we would
> end up spilling SGPRs.
LGTM. Does this fix the extra move from an SGPR for every legalized 
operand problem?


> ---
>   lib/Target/R600/SIInstrInfo.cpp   | 20 --------------------
>   lib/Target/R600/SIInstructions.td | 12 ++++++++----
>   2 files changed, 8 insertions(+), 24 deletions(-)
>
> diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp
> index 05b4041..c653f43 100644
> --- a/lib/Target/R600/SIInstrInfo.cpp
> +++ b/lib/Target/R600/SIInstrInfo.cpp
> @@ -325,26 +325,6 @@ SIInstrInfo::copyPhysReg(MachineBasicBlock &MBB,
>     unsigned Opcode;
>     const int16_t *SubIndices;
>   
> -  if (AMDGPU::M0 == DestReg) {
> -    // Check if M0 isn't already set to this value
> -    for (MachineBasicBlock::reverse_iterator E = MBB.rend(),
> -      I = MachineBasicBlock::reverse_iterator(MI); I != E; ++I) {
> -
> -      if (!I->definesRegister(AMDGPU::M0))
> -        continue;
> -
> -      unsigned Opc = I->getOpcode();
> -      if (Opc != TargetOpcode::COPY && Opc != AMDGPU::S_MOV_B32)
> -        break;
> -
> -      if (!I->readsRegister(SrcReg))
> -        break;
> -
> -      // The copy isn't necessary
> -      return;
> -    }
> -  }
> -
I was wondering what this was

>     if (AMDGPU::SReg_32RegClass.contains(DestReg)) {
>       assert(AMDGPU::SReg_32RegClass.contains(SrcReg));
>       BuildMI(MBB, MI, DL, get(AMDGPU::S_MOV_B32), DestReg)
> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> index 66468af..3f1f4f1 100644
> --- a/lib/Target/R600/SIInstructions.td
> +++ b/lib/Target/R600/SIInstructions.td
> @@ -2484,17 +2484,21 @@ def : Pat <
>   /********** Interpolation Paterns **********/
>   /********** ===================== **********/
>   
> +// The value of $params is constant through out the entire kernel.
> +// We need to use S_MOV_B32 $params, because CSE ignores copies, so
> +// without it we end up with a lot of redundant moves.
> +
>   def : Pat <
>     (int_SI_fs_constant imm:$attr_chan, imm:$attr, i32:$params),
> -  (V_INTERP_MOV_F32 INTERP.P0, imm:$attr_chan, imm:$attr, $params)
> +  (V_INTERP_MOV_F32 INTERP.P0, imm:$attr_chan, imm:$attr, (S_MOV_B32 $params))
>   >;
>   
>   def : Pat <
> -  (int_SI_fs_interp imm:$attr_chan, imm:$attr, M0Reg:$params, v2i32:$ij),
> +  (int_SI_fs_interp imm:$attr_chan, imm:$attr, i32:$params, v2i32:$ij),
>     (V_INTERP_P2_F32 (V_INTERP_P1_F32 (EXTRACT_SUBREG v2i32:$ij, sub0),
> -                                    imm:$attr_chan, imm:$attr, i32:$params),
> +                                    imm:$attr_chan, imm:$attr, (S_MOV_B32 $params)),
>                      (EXTRACT_SUBREG $ij, sub1),
> -                   imm:$attr_chan, imm:$attr, $params)
> +                   imm:$attr_chan, imm:$attr, (S_MOV_B32 $params))
>   >;
>   
>   /********** ================== **********/
> -- 1.8.5.5
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20141113/def61f39/attachment.html>


More information about the llvm-commits mailing list