PATCHES: R600/SI: Codesize improvements

Thu Apr 23 15:14:34 PDT 2015

On Thu, Apr 23, 2015 at 01:49:36PM -0700, Matt Arsenault wrote:
> On 04/23/2015 12:58 PM, Tom Stellard wrote:
> > Hi,
> >
> > The attached patches reduce overall shader code size by preferring
> > v_mac_f32 over v_mad_f32 and also using the 32-bit encoding for
> > v_cndmask when src2 is vcc.
> >
> > -Tom
> 
> >
> > 0002-R600-SI-Fix-crash-on-physical-registers-in-SIInstrIn.patch
> >
> >
> >  From 762f0757d4f237c474026c949e66a0f36d0d4ae7 Mon Sep 17 00:00:00 2001
> > From: Tom Stellard<thomas.stellard at amd.com>
> > Date: Mon, 20 Apr 2015 18:16:23 +0000
> > Subject: [PATCH 2/5] R600/SI: Fix crash on physical registers in
> >   SIInstrInfo::isOperandLegal()
> >
> > No test case for this.  I ran into it while working on some improvements
> > to SIShrinkInstructions.cpp.
> > ---
> >   lib/Target/R600/SIInstrInfo.cpp | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> LGTM
> 
> >
> >
> > 0003-R600-SI-The-SIShrinkInstructions-pass-should-only-fo.patch
> >
> >
> >  From 426618d94a397e42d2c966fe192cc0642983d7f1 Mon Sep 17 00:00:00 2001
> > From: Tom Stellard<thomas.stellard at amd.com>
> > Date: Tue, 21 Apr 2015 20:31:37 +0000
> > Subject: [PATCH 3/5] R600/SI: The SIShrinkInstructions pass should only fold
> >   immediates with one use
> >
> > This is convered by existing testcases and will be exposed by a future
> > commit.
> > ---
> >   lib/Target/R600/SIShrinkInstructions.cpp | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> LGTM
> 
> >
> > 0004-R600-SI-Select-mad-patterns-to-v_mac_f32.patch
> >
> >
> >  From 5ca6edd9e5a1e79608720601088442be611bb8ab Mon Sep 17 00:00:00 2001
> > From: Tom Stellard<thomas.stellard at amd.com>
> > Date: Mon, 20 Apr 2015 18:18:54 +0000
> > Subject: [PATCH 4/5] R600/SI: Select mad patterns to v_mac_f32
> >
> > The two-address instruction pass will convert these back to v_mad_f32
> > if necessary.
> >
> > shader-db stats:
> >
> > 979 shaders
> > Totals:
> > SGPRS: 34792 -> 35048 (0.74 %)
> > VGPRS: 20740 -> 20560 (-0.87 %)
> > Code Size: 747712 -> 657436 (-12.07 %) bytes
> > LDS: 11 -> 11 (0.00 %) blocks
> > Scratch: 12288 -> 18432 (50.00 %) bytes per wave
> >
> > Totals from affected shaders:
> > SGPRS: 31272 -> 31488 (0.69 %)
> > VGPRS: 18788 -> 18608 (-0.96 %)
> > Code Size: 728328 -> 638092 (-12.39 %) bytes
> > LDS: 11 -> 11 (0.00 %) blocks
> > Scratch: 12288 -> 18432 (50.00 %) bytes per wave
> >
> > Increases:
> > SGPRS: 36 (0.04 %)
> > VGPRS: 13 (0.01 %)
> > Code Size: 0 (0.00 %)
> > LDS: 0 (0.00 %)
> > Scratch: 1 (0.00 %)
> >
> > Decreases:
> > SGPRS: 12 (0.01 %)
> > VGPRS: 48 (0.05 %)
> > Code Size: 779 (0.80 %)
> > LDS: 0 (0.00 %)
> > Scratch: 0 (0.00 %)
> > ---
> >   lib/Target/R600/AMDGPUISelDAGToDAG.cpp   |  19 ++++
> >   lib/Target/R600/SIFoldOperands.cpp       |  31 +++++++
> >   lib/Target/R600/SIInstrInfo.cpp          |  56 ++++++++++-
> >   lib/Target/R600/SIInstrInfo.h            |   4 +
> >   lib/Target/R600/SIInstrInfo.td           |   9 ++
> >   lib/Target/R600/SIInstructions.td        |  14 ++-
> >   lib/Target/R600/SIShrinkInstructions.cpp |  16 +++-
> >   test/CodeGen/R600/fmuladd.ll             |  30 +++---
> >   test/CodeGen/R600/llvm.amdgpu.lrp.ll     |   2 +-
> >   test/CodeGen/R600/mad-combine.ll         |  25 +++--
> >   test/CodeGen/R600/mad-sub.ll             |   6 +-
> >   test/CodeGen/R600/madak.ll               |  12 +--
> >   test/CodeGen/R600/madmk.ll               |  10 +-
> >   test/CodeGen/R600/v_mac.ll               | 155 +++++++++++++++++++++++++++++++
> >   14 files changed, 341 insertions(+), 48 deletions(-)
> >   create mode 100644 test/CodeGen/R600/v_mac.ll
> >
> > diff --git a/lib/Target/R600/AMDGPUISelDAGToDAG.cpp b/lib/Target/R600/AMDGPUISelDAGToDAG.cpp
> > index def252a..85cdf62 100644
> > --- a/lib/Target/R600/AMDGPUISelDAGToDAG.cpp
> > +++ b/lib/Target/R600/AMDGPUISelDAGToDAG.cpp
> > @@ -109,8 +109,11 @@ private:
> >                            SDValue &Offset, SDValue &GLC) const;
> >     SDNode *SelectAddrSpaceCast(SDNode *N);
> >     bool SelectVOP3Mods(SDValue In, SDValue &Src, SDValue &SrcMods) const;
> > +  bool SelectVOP3NoMods(SDValue In, SDValue &Src, SDValue &SrcMods) const;
> >     bool SelectVOP3Mods0(SDValue In, SDValue &Src, SDValue &SrcMods,
> >                          SDValue &Clamp, SDValue &Omod) const;
> > +  bool SelectVOP3NoMods0(SDValue In, SDValue &Src, SDValue &SrcMods,
> > +                         SDValue &Clamp, SDValue &Omod) const;
> >   
> >     bool SelectVOP3Mods0Clamp(SDValue In, SDValue &Src, SDValue &SrcMods,
> >                               SDValue &Omod) const;
> > @@ -1264,6 +1267,12 @@ bool AMDGPUDAGToDAGISel::SelectVOP3Mods(SDValue In, SDValue &Src,
> >     return true;
> >   }
> >   
> > +bool AMDGPUDAGToDAGISel::SelectVOP3NoMods(SDValue In, SDValue &Src,
> > +                                         SDValue &SrcMods) const {
> > +  bool Res = SelectVOP3Mods(In, Src, SrcMods);
> > +  return Res && cast<ConstantSDNode>(SrcMods)->isNullValue();
> > +}
> > +
> >   bool AMDGPUDAGToDAGISel::SelectVOP3Mods0(SDValue In, SDValue &Src,
> >                                            SDValue &SrcMods, SDValue &Clamp,
> >                                            SDValue &Omod) const {
> > @@ -1274,6 +1283,16 @@ bool AMDGPUDAGToDAGISel::SelectVOP3Mods0(SDValue In, SDValue &Src,
> >     return SelectVOP3Mods(In, Src, SrcMods);
> >   }
> >   
> > +bool AMDGPUDAGToDAGISel::SelectVOP3NoMods0(SDValue In, SDValue &Src,
> > +                                           SDValue &SrcMods, SDValue &Clamp,
> > +                                           SDValue &Omod) const {
> > +  bool Res = SelectVOP3Mods0(In, Src, SrcMods, Clamp, Omod);
> > +
> > +  return Res && cast<ConstantSDNode>(SrcMods)->isNullValue() &&
> > +                cast<ConstantSDNode>(Clamp)->isNullValue() &&
> > +                cast<ConstantSDNode>(Omod)->isNullValue();
> > +}
> > +
> >   bool AMDGPUDAGToDAGISel::SelectVOP3Mods0Clamp(SDValue In, SDValue &Src,
> >                                                 SDValue &SrcMods,
> >                                                 SDValue &Omod) const {
> > diff --git a/lib/Target/R600/SIFoldOperands.cpp b/lib/Target/R600/SIFoldOperands.cpp
> > index 7ba5a6d..c4de645 100644
> > --- a/lib/Target/R600/SIFoldOperands.cpp
> > +++ b/lib/Target/R600/SIFoldOperands.cpp
> > @@ -126,11 +126,42 @@ static bool updateOperand(FoldCandidate &Fold,
> >     return false;
> >   }
> >   
> > +static bool isUseMIInFoldList(const std::vector<FoldCandidate> &FoldList,
> > +                              const MachineInstr *MI) {
> > +  for (auto Candidate : FoldList) {
> > +    if (Candidate.UseMI == MI)
> > +      return true;
> > +  }
> > +  return false;
> > +}
> > +
> >   static bool tryAddToFoldList(std::vector<FoldCandidate> &FoldList,
> >                                MachineInstr *MI, unsigned OpNo,
> >                                MachineOperand *OpToFold,
> >                                const SIInstrInfo *TII) {
> >     if (!TII->isOperandLegal(MI, OpNo, OpToFold)) {
> > +
> > +    // Special case for v_mac_f32_e64 if we are trying to fold into src2
> > +    unsigned Opc = MI->getOpcode();
> > +    if (Opc == AMDGPU::V_MAC_F32_e64 &&
> > +        (int)OpNo == AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2)) {
> > +      // Check if changing this to a v_mad_f32 instruction will allow us to
> > +      // fold the operand.
> > +      MI->setDesc(TII->get(AMDGPU::V_MAD_F32));
> > +      bool FoldAsMAD = tryAddToFoldList(FoldList, MI, OpNo, OpToFold, TII);
> > +      if (FoldAsMAD) {
> > +        MI->untieRegOperand(OpNo);
> > +        return true;
> > +      }
> > +      MI->setDesc(TII->get(Opc));
> > +    }
> > +
> > +    // If we are already folding into another operand of MI, then
> > +    // we can't commute the instruction, otherwise we risk making the
> > +    // other fold illegal.
> > +    if (isUseMIInFoldList(FoldList, MI))
> > +      return false;
> > +
> >       // Operand is not legal, so try to commute the instruction to
> >       // see if this makes it possible to fold.
> >       unsigned CommuteIdx0;
> > diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp
> > index 931e984..223f0bf 100644
> > --- a/lib/Target/R600/SIInstrInfo.cpp
> > +++ b/lib/Target/R600/SIInstrInfo.cpp
> > @@ -906,7 +906,7 @@ bool SIInstrInfo::FoldImmediate(MachineInstr *UseMI, MachineInstr *DefMI,
> >       return false;
> >   
> >     unsigned Opc = UseMI->getOpcode();
> > -  if (Opc == AMDGPU::V_MAD_F32) {
> > +  if (Opc == AMDGPU::V_MAD_F32 || Opc == AMDGPU::V_MAC_F32_e64) {
> >       // Don't fold if we are using source modifiers. The new VOP2 instructions
> >       // don't have them.
> >       if (hasModifiersSet(*UseMI, AMDGPU::OpName::src0_modifiers) ||
> > @@ -945,9 +945,9 @@ bool SIInstrInfo::FoldImmediate(MachineInstr *UseMI, MachineInstr *DefMI,
> >         // instead of having to modify in place.
> >   
> >         // Remove these first since they are at the end.
> > -      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(AMDGPU::V_MAD_F32,
> > +      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(Opc,
> >                                                         AMDGPU::OpName::omod));
> > -      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(AMDGPU::V_MAD_F32,
> > +      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(Opc,
> >                                                         AMDGPU::OpName::clamp));
> >   
> >         unsigned Src1Reg = Src1->getReg();
> > @@ -959,6 +959,14 @@ bool SIInstrInfo::FoldImmediate(MachineInstr *UseMI, MachineInstr *DefMI,
> >         Src1->setReg(Src2Reg);
> >         Src1->setSubReg(Src2SubReg);
> >   
> > +      if (Opc == AMDGPU::V_MAC_F32_e64) {
> > +        UseMI->untieRegOperand(
> > +          AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2));
> > +      }
> > +
> > +      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(Opc,
> > +                                                      AMDGPU::OpName::src2));
> > +      // ChangingToImmediate adds Src2 back to the instruction.
> >         Src2->ChangeToImmediate(Imm);
> >   
> >         removeModOperands(*UseMI);
> > @@ -989,11 +997,17 @@ bool SIInstrInfo::FoldImmediate(MachineInstr *UseMI, MachineInstr *DefMI,
> >         // instead of having to modify in place.
> >   
> >         // Remove these first since they are at the end.
> > -      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(AMDGPU::V_MAD_F32,
> > +      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(Opc,
> >                                                         AMDGPU::OpName::omod));
> > -      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(AMDGPU::V_MAD_F32,
> > +      UseMI->RemoveOperand(AMDGPU::getNamedOperandIdx(Opc,
> >                                                         AMDGPU::OpName::clamp));
> >   
> > +      if (Opc == AMDGPU::V_MAC_F32_e64) {
> > +        UseMI->untieRegOperand(
> > +          AMDGPU::getNamedOperandIdx(Opc, AMDGPU::OpName::src2));
> > +      }
> > +
> > +      // ChangingToImmediate adds Src2 back to the instruction.
> >         Src2->ChangeToImmediate(Imm);
> >   
> >         // These come before src2.
> > @@ -1105,6 +1119,38 @@ bool SIInstrInfo::areMemAccessesTriviallyDisjoint(MachineInstr *MIa,
> >     return false;
> >   }
> >   
> > +MachineInstr *SIInstrInfo::convertToThreeAddress(MachineFunction::iterator &MBB,
> > +                                                MachineBasicBlock::iterator &MI,
> > +                                                LiveVariables *LV) const {
> > +
> > +  switch (MI->getOpcode()) {
> > +    default: return nullptr;
> > +    case AMDGPU::V_MAC_F32_e64: break;
> > +    case AMDGPU::V_MAC_F32_e32: {
> > +      const MachineOperand *Src0 = getNamedOperand(*MI, AMDGPU::OpName::src0);
> > +      if (Src0->isImm() && !isInlineConstant(*Src0, 4))
> > +        return nullptr;
> > +      break;
> > +    }
> > +  }
> > +
> > +  const MachineOperand *Dst = getNamedOperand(*MI, AMDGPU::OpName::dst);
> > +  const MachineOperand *Src0 = getNamedOperand(*MI, AMDGPU::OpName::src0);
> > +  const MachineOperand *Src1 = getNamedOperand(*MI, AMDGPU::OpName::src1);
> > +  const MachineOperand *Src2 = getNamedOperand(*MI, AMDGPU::OpName::src2);
> > +
> > +  return BuildMI(*MBB, MI, MI->getDebugLoc(), get(AMDGPU::V_MAD_F32))
> > +                 .addOperand(*Dst)
> > +                 .addImm(0) // Src0 mods
> > +                 .addOperand(*Src0)
> > +                 .addImm(0) // Src1 mods
> > +                 .addOperand(*Src1)
> > +                 .addImm(0) // Src mods
> > +                 .addOperand(*Src2)
> > +                 .addImm(0)  // clamp
> > +                 .addImm(0); // omod
> > +}
> > +
> >   bool SIInstrInfo::isInlineConstant(const APInt &Imm) const {
> >     int64_t SVal = Imm.getSExtValue();
> >     if (SVal >= -16 && SVal <= 64)
> > diff --git a/lib/Target/R600/SIInstrInfo.h b/lib/Target/R600/SIInstrInfo.h
> > index a9aa99f..45a1dec 100644
> > --- a/lib/Target/R600/SIInstrInfo.h
> > +++ b/lib/Target/R600/SIInstrInfo.h
> > @@ -139,6 +139,10 @@ public:
> >     bool FoldImmediate(MachineInstr *UseMI, MachineInstr *DefMI,
> >                        unsigned Reg, MachineRegisterInfo *MRI) const final;
> >   
> > +  MachineInstr *convertToThreeAddress(MachineFunction::iterator &MBB,
> > +                                      MachineBasicBlock::iterator &MI,
> > +                                      LiveVariables *LV) const override;
> > +
> >     bool isSALU(uint16_t Opcode) const {
> >       return get(Opcode).TSFlags & SIInstrFlags::SALU;
> >     }
> > diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
> > index 076a0ce..6310e1f 100644
> > --- a/lib/Target/R600/SIInstrInfo.td
> > +++ b/lib/Target/R600/SIInstrInfo.td
> > @@ -404,9 +404,11 @@ def MUBUFOffset : ComplexPattern<i64, 6, "SelectMUBUFOffset">;
> >   def MUBUFOffsetAtomic : ComplexPattern<i64, 4, "SelectMUBUFOffset">;
> >   
> >   def VOP3Mods0 : ComplexPattern<untyped, 4, "SelectVOP3Mods0">;
> > +def VOP3NoMods0 : ComplexPattern<untyped, 4, "SelectVOP3NoMods0">;
> >   def VOP3Mods0Clamp : ComplexPattern<untyped, 3, "SelectVOP3Mods0Clamp">;
> >   def VOP3Mods0Clamp0OMod : ComplexPattern<untyped, 4, "SelectVOP3Mods0Clamp0OMod">;
> >   def VOP3Mods  : ComplexPattern<untyped, 2, "SelectVOP3Mods">;
> > +def VOP3NoMods : ComplexPattern<untyped, 2, "SelectVOP3NoMods">;
> >   
> >   //===----------------------------------------------------------------------===//
> >   // SI assembler operands
> > @@ -978,6 +980,13 @@ def VOP_MADK : VOPProfile <[f32, f32, f32, f32]> {
> >     field dag Ins = (ins VCSrc_32:$src0, VGPR_32:$vsrc1, u32imm:$src2);
> >     field string Asm = "$dst, $src0, $vsrc1, $src2";
> >   }
> > +def VOP_MAC : VOPProfile <[f32, f32, f32, f32]> {
> > +  let Ins32 = (ins Src0RC32:$src0, Src1RC32:$src1, VGPR_32:$src2);
> > +  let Ins64 = getIns64<Src0RC64, Src1RC64, RegisterOperand<VGPR_32>, 3,
> > +                             HasModifiers>.ret;
> > +  let Asm32 = getAsm32<2>.ret;
> > +  let Asm64 = getAsm64<2, HasModifiers>.ret;
> > +}
> >   def VOP_F64_F64_F64_F64 : VOPProfile <[f64, f64, f64, f64]>;
> >   def VOP_I32_I32_I32_I32 : VOPProfile <[i32, i32, i32, i32]>;
> >   def VOP_I64_I32_I32_I64 : VOPProfile <[i64, i32, i32, i64]>;
> > diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> > index 91e8c8c..9ddd3e7 100644
> > --- a/lib/Target/R600/SIInstructions.td
> > +++ b/lib/Target/R600/SIInstructions.td
> > @@ -1539,7 +1539,10 @@ defm V_AND_B32 : VOP2Inst <vop2<0x1b, 0x13>, "v_and_b32", VOP_I32_I32_I32>;
> >   defm V_OR_B32 : VOP2Inst <vop2<0x1c, 0x14>, "v_or_b32", VOP_I32_I32_I32>;
> >   defm V_XOR_B32 : VOP2Inst <vop2<0x1d, 0x15>, "v_xor_b32", VOP_I32_I32_I32>;
> >   
> > -defm V_MAC_F32 : VOP2Inst <vop2<0x1f, 0x16>, "v_mac_f32", VOP_F32_F32_F32>;
> > +let Constraints = "$dst = $src2", DisableEncoding="$src2",
> > +    isConvertibleToThreeAddress = 1 in {
> > +defm V_MAC_F32 : VOP2Inst <vop2<0x1f, 0x16>, "v_mac_f32", VOP_MAC>;
> > +}
> >   } // End isCommutable = 1
> >   
> >   defm V_MADMK_F32 : VOP2MADK <vop2<0x20, 0x17>, "v_madmk_f32">;
> > @@ -2251,6 +2254,15 @@ def : Pat <
> >     (V_CNDMASK_B32_e64 $src2, $src1, $src0)
> >   >;
> >   
> > +// Pattern for V_MAC_F32
> > +def : Pat <
> > +  (fmad  (VOP3NoMods0 f32:$src0, i32:$src0_modifiers, i1:$clamp, i32:$omod),
> > +         (VOP3NoMods f32:$src1, i32:$src1_modifiers),
> > +         (VOP3NoMods f32:$src2, i32:$src2_modifiers)),
> > +  (V_MAC_F32_e64 $src0_modifiers, $src0, $src1_modifiers, $src1,
> > +                 $src2_modifiers, $src2, $clamp, $omod)
> > +>;
> 
> If there are modifiers, I assume this will still select v_mad_f32?
> 

Yes, that's correct.

> Is there any reason to use v_mac_f32_e64 if the modifiers are never 
> going to be used?
> 

I was trying to follow the convention of always selecting to _e64, but
also _e64 can match: fmad sgpr, inline, vgpr, which _e32 can't.

> It probably isn't that helpful, but using the modifiers with v_mac_f32 
> might slightly reduce register pressure in some cases vs. v_mad_f32 + 
> modifiers
> 

I thought about that, but it seemed like using v_mad_f32 gave the
register allocator more freedom, because it could allocate v_mad_f32
with $dst = $src2, which would be equivalent to v_mac_f32 or it could
have $dst and $src2 be different registers if it determined this was
better.

-Tom

> > 0005-R600-SI-Add-support-for-shrinking-v_cndmask_b32_e32-.patch
> >
> >
> >  From 1ab9290ffa4835062bd563496fa71ca27b7ed8cd Mon Sep 17 00:00:00 2001
> > From: Tom Stellard<thomas.stellard at amd.com>
> > Date: Tue, 21 Apr 2015 23:24:40 +0000
> > Subject: [PATCH 5/5] R600/SI: Add support for shrinking v_cndmask_b32_e32
> >   instructions
> >
> > shader-db stats:
> >
> > 979 shaders
> > Totals:
> > SGPRS: 35048 -> 35176 (0.37 %)
> > VGPRS: 20560 -> 20560 (0.00 %)
> > Code Size: 657436 -> 651536 (-0.90 %) bytes
> > LDS: 11 -> 11 (0.00 %) blocks
> > Scratch: 18432 -> 18432 (0.00 %) bytes per wave
> >
> > Totals from affected shaders:
> > SGPRS: 5504 -> 5632 (2.33 %)
> > VGPRS: 3456 -> 3456 (0.00 %)
> > Code Size: 242948 -> 237048 (-2.43 %) bytes
> > LDS: 1 -> 1 (0.00 %) blocks
> > Scratch: 8192 -> 8192 (0.00 %) bytes per wave
> >
> > Increases:
> > SGPRS: 16 (0.02 %)
> > VGPRS: 0 (0.00 %)
> > Code Size: 0 (0.00 %)
> > LDS: 0 (0.00 %)
> > Scratch: 0 (0.00 %)
> >
> > Decreases:
> > SGPRS: 0 (0.00 %)
> > VGPRS: 0 (0.00 %)
> > Code Size: 104 (0.11 %)
> > LDS: 0 (0.00 %)
> > Scratch: 0 (0.00 %)
> > ---
> >   lib/Target/R600/SIShrinkInstructions.cpp |  29 ++++++--
> >   test/CodeGen/R600/llvm.round.ll          |   4 +-
> >   test/CodeGen/R600/select-vectors.ll      | 116 +++++++++++++++----------------
> >   test/CodeGen/R600/select64.ll            |   4 +-
> >   test/CodeGen/R600/sint_to_fp.f64.ll      |   8 +--
> >   test/CodeGen/R600/uint_to_fp.f64.ll      |  10 +--
> >   test/CodeGen/R600/vselect.ll             |  34 ++++-----
> >   test/CodeGen/R600/xor.ll                 |   4 +-
> >   8 files changed, 115 insertions(+), 94 deletions(-)
> >
> > diff --git a/lib/Target/R600/SIShrinkInstructions.cpp b/lib/Target/R600/SIShrinkInstructions.cpp
> > index e7511e6..0f181d3 100644
> > --- a/lib/Target/R600/SIShrinkInstructions.cpp
> > +++ b/lib/Target/R600/SIShrinkInstructions.cpp
> > @@ -95,13 +95,19 @@ static bool canShrink(MachineInstr &MI, const SIInstrInfo *TII,
> >     // a register allocation hint pre-regalloc and then do the shrining
> >     // post-regalloc.
> >     if (Src2) {
> > -    if (MI.getOpcode() != AMDGPU::V_MAC_F32_e64)
> > -      return false;
> > -
> >       const MachineOperand *Src2Mod =
> >           TII->getNamedOperand(MI, AMDGPU::OpName::src2_modifiers);
> > -    if (!isVGPR(Src2, TRI, MRI) || (Src2Mod && Src2Mod->getImm() != 0))
> > -      return false;
> > +    switch (MI.getOpcode()) {
> > +      default: return false;
> > +
> > +      case AMDGPU::V_MAC_F32_e64:
> > +        if (!isVGPR(Src2, TRI, MRI) || (Src2Mod && Src2Mod->getImm() != 0))
> > +          return false;
> > +        break;
> You can simplify this Src2Mod check slightly with 
> SIInstrInfo::hasModifiersSet
> > +
> > +      case AMDGPU::V_CNDMASK_B32_e64:
> > +        break;
> > +    }
> >     }
> >   
> >     const MachineOperand *Src1 = TII->getNamedOperand(MI, AMDGPU::OpName::src1);
> > @@ -250,6 +256,19 @@ bool SIShrinkInstructions::runOnMachineFunction(MachineFunction &MF) {
> >             continue;
> >         }
> >   
> > +      if (Op32 == AMDGPU::V_CNDMASK_B32_e32) {
> > +        // We shrink V_CNDMASK_B32_e64 using regalloc hints like we do for VOPC
> > +        // instructions.
> > +        unsigned SReg =
> > +            TII->getNamedOperand(MI, AMDGPU::OpName::src2)->getReg();
> I think it might be possible though unlikely that a V_CNDMASK_B32_e64 
> could be emitted with an immediate / non-register value for src2
> > +        if (TargetRegisterInfo::isVirtualRegister(SReg)) {
> > +          MRI.setRegAllocationHint(SReg, 0, AMDGPU::VCC);
> > +          continue;
> > +        }
> > +        if (SReg != AMDGPU::VCC)
> > +          continue;
> > +      }
> > +
> >         // We can shrink this instruction
> >         DEBUG(dbgs() << "Shrinking "; MI.dump(); dbgs() << '\n';);
> >   
> >
>