[Mesa-dev] [PATCH] R600: Relax some vector constraints on Dot4.

Fri Mar 15 08:09:26 PDT 2013

Ok that makes more sense, thx for the explanation.

I was just wondering why all this stuff is needed, and as I said, I'm 
not so deeply into the R600 part of the backend.

And I strongly agree that we somehow need to better teach the backend 
about our vector slots and all those limitations (Sometimes I'm really 
happy to work on SI instead).

Christian.

Am 15.03.2013 15:52, schrieb Vincent Lejeune:
> Hi Christian,
>
> LLVM does indeed coalesce registers for R600 targets, I was however thinking of copies between vectors.
> For instance, let's say you have 4 vectors coming from instructions that only emit vectors (like TEX_SAMPLE iirc) :
>
> If the shader wants to "mix" them before doing dp4, you end with something like :
>
> T0_XYZW = TEX_SAMPLE
> T1_XYZW = TEX_SAMPLE
> T2_XYZW = TEX_SAMPLE
> T3_XYZW = TEX_SAMPLE
> T0_W = COPY T4_W
> T1_Z =  COPY T3_Z
> DOT4 T0_XYZW, T1_XYZW
>
>  From hw point of view, the 2 copies are not necessary because DOT4 instructions does not require that its operands belong to the same 128 bits register.
> It's perfectly legal to have a bundle like this one :
>
> Dot4_eg_real T0_X T1_X
> Dot4_eg_real T0_Y T1_Y
> Dot4_eg_real T0_Z T3_Z
> Dot4_eg_real T4_W T1_W
>
> (In fact it is even possible to remove the R600_TReg32_* constraints on the inputs but then you have to ensure the bundle does not read more than
> 3 gprs from a channel which need much more work)
>
> The previous case may seem not so frequent but it still occurs in Lightmark and Unigine Heaven.
>
> We represent dot4 inputs as vectors but using 8 scalar inputs is closer from hw capabilities, that's why I wrote this patch. Besides, scalar values usually
> have shorter live interval, lowering register pressure. Shaders that have a dp4 instructions often end up consuming less registers with this patch.
>
> Vincent
>
>
>
>
> ----- Mail original -----
>> De : Christian König <deathsimple at vodafone.de>
>> À : Vincent Lejeune <vljn at ovi.com>
>> Cc : llvm-commits at cs.uiuc.edu; mesa-dev at lists.freedesktop.org
>> Envoyé le : Vendredi 15 mars 2013 11h18
>> Objet : Re: [Mesa-dev] [PATCH] R600: Relax some vector constraints on Dot4.
>>
>> Hi Vincent,
>>
>> while I really appreciate your work, I think you're development is going
>> into the wrong direction here. Those copies you're trying to avoid (not only
>> with this patch, but also with the previous REG_SEQUENCE patches), shouldn't
>> happen in the first place. I'm not so deeply into the R600 part of our LLVM
>> backend that I can say that I'm 100% sure, but to me that just looks like
>> workarounds to an incorrect defined register space.
>>
>> Here is an simple example from SI, that should show how things are intended to
>> work. It's a simple 2D texture fetch, the coordinates of that this fetch are
>> usually provided in an two element vector build of VGPRs (I use a 2D fetch just
>> for simplicity, a 3D fetch with explicit LOD would work the same way and would
>> use a four element vector).
>>
>> After ISel the assembler code starts with something like this (simplified):
>> ...
>> %vreg13<def,tied1> = V_INTERP_P2_F32 ...
>> ...
>> %vreg17<def,tied1> = V_INTERP_P2_F32 ...
>> ...
>> %vreg22<def> = IMPLICIT_DEF; VReg_64:%vreg22
>> %vreg21<def,tied1> = INSERT_SUBREG %vreg22<tied0>,
>> %vreg13<kill>, sub0; VReg_64:%vreg21,%vreg22 VReg_32:%vreg13
>> %vreg23<def,tied1> = INSERT_SUBREG %vreg21<tied0>,
>> %vreg17<kill>, sub1; VReg_64:%vreg23,%vreg21 VReg_32:%vreg17
>> %vreg24<def> = IMAGE_SAMPLE 15, 0, 0, 0, 0, 0, 0, 0, %vreg23<kill>,
>> ....
>>
>> As you can see the sub components of the vectors are inserted/extracted just
>> like it happens on R600, but the registerallocater is capable of handling that
>> much better than on R600 and so avoiding the (sometimes quite expensive) COPY
>> operations in the first place. The resulting code looks like this:
>>
>> ...
>> %vreg23:sub0<def,tied1> = V_INTERP_P2_F32 ...
>> ...
>> %vreg23:sub1<def,tied1> = V_INTERP_P2_F32 ...
>> ....
>> %vreg24<def> = IMAGE_SAMPLE 15, 0, 0, 0, 0, 0, 0, 0, %vreg23, ...
>>
>> So INSERT_SUBREG isn't replaced with a COPY like on R600, but instead the
>> V_INTERP_P2_F32 instructions can write directly to the appropriate sub register
>> component.
>>
>> I'm not 100% sure why this doesn't work the same way on R600, but I
>> think it might be a good idea figuring that out.
>>
>> Cheers,
>> Christian.
>>
>> Am 14.03.2013 21:51, schrieb Vincent Lejeune:
>>>   Dot4 now uses 8 scalar operands instead of 2 vectors one which allows
>> register
>>>   coalescer to remove some unneeded COPY.
>>>   This patch also defines some structures/functions that can be used to
>> handle
>>>   every vector instructions (CUBE, Cayman special instructions...) in a
>> similar
>>>   fashion.
>>>   ---
>>>     lib/Target/R600/AMDGPUISelLowering.h        |  1 +
>>>     lib/Target/R600/R600Defines.h               | 74 ++++++++++++++++++++++++
>>>     lib/Target/R600/R600ExpandSpecialInstrs.cpp | 25 ++++++++
>>>     lib/Target/R600/R600ISelLowering.cpp        | 21 +++++++
>>>     lib/Target/R600/R600InstrInfo.cpp           | 88
>> +++++++++++++++++++++++++++++
>>>     lib/Target/R600/R600InstrInfo.h             |  5 ++
>>>     lib/Target/R600/R600Instructions.td         | 51 ++++++++++++++++-
>>>     lib/Target/R600/R600MachineScheduler.cpp    |  2 +
>>>     8 files changed, 266 insertions(+), 1 deletion(-)
>>>
>>>   diff --git a/lib/Target/R600/AMDGPUISelLowering.h
>> b/lib/Target/R600/AMDGPUISelLowering.h
>>>   index f31b646..f9f5a60 100644
>>>   --- a/lib/Target/R600/AMDGPUISelLowering.h
>>>   +++ b/lib/Target/R600/AMDGPUISelLowering.h
>>>   @@ -125,6 +125,7 @@ enum {
>>>       SMIN,
>>>       UMIN,
>>>       URECIP,
>>>   +  DOT4,
>>>       EXPORT,
>>>       CONST_ADDRESS,
>>>       REGISTER_LOAD,
>>>   diff --git a/lib/Target/R600/R600Defines.h b/lib/Target/R600/R600Defines.h
>>>   index 16cfcf5..72d83b0 100644
>>>   --- a/lib/Target/R600/R600Defines.h
>>>   +++ b/lib/Target/R600/R600Defines.h
>>>   @@ -92,6 +92,80 @@ namespace R600Operands {
>>>         {0,-1,-1,-1,-1, 1, 2, 3, 4, 5,-1, 6, 7, 8,
>> 9,-1,10,11,12,13,14,15,16,17}
>>>       };
>>>     +  enum VecOps {
>>>   +    UPDATE_EXEC_MASK_X,
>>>   +    UPDATE_PREDICATE_X,
>>>   +    WRITE_X,
>>>   +    OMOD_X,
>>>   +    DST_REL_X,
>>>   +    CLAMP_X,
>>>   +    SRC0_X,
>>>   +    SRC0_NEG_X,
>>>   +    SRC0_REL_X,
>>>   +    SRC0_ABS_X,
>>>   +    SRC0_SEL_X,
>>>   +    SRC1_X,
>>>   +    SRC1_NEG_X,
>>>   +    SRC1_REL_X,
>>>   +    SRC1_ABS_X,
>>>   +    SRC1_SEL_X,
>>>   +    PRED_SEL_X,
>>>   +    UPDATE_EXEC_MASK_Y,
>>>   +    UPDATE_PREDICATE_Y,
>>>   +    WRITE_Y,
>>>   +    OMOD_Y,
>>>   +    DST_REL_Y,
>>>   +    CLAMP_Y,
>>>   +    SRC0_Y,
>>>   +    SRC0_NEG_Y,
>>>   +    SRC0_REL_Y,
>>>   +    SRC0_ABS_Y,
>>>   +    SRC0_SEL_Y,
>>>   +    SRC1_Y,
>>>   +    SRC1_NEG_Y,
>>>   +    SRC1_REL_Y,
>>>   +    SRC1_ABS_Y,
>>>   +    SRC1_SEL_Y,
>>>   +    PRED_SEL_Y,
>>>   +    UPDATE_EXEC_MASK_Z,
>>>   +    UPDATE_PREDICATE_Z,
>>>   +    WRITE_Z,
>>>   +    OMOD_Z,
>>>   +    DST_REL_Z,
>>>   +    CLAMP_Z,
>>>   +    SRC0_Z,
>>>   +    SRC0_NEG_Z,
>>>   +    SRC0_REL_Z,
>>>   +    SRC0_ABS_Z,
>>>   +    SRC0_SEL_Z,
>>>   +    SRC1_Z,
>>>   +    SRC1_NEG_Z,
>>>   +    SRC1_REL_Z,
>>>   +    SRC1_ABS_Z,
>>>   +    SRC1_SEL_Z,
>>>   +    PRED_SEL_Z,
>>>   +    UPDATE_EXEC_MASK_W,
>>>   +    UPDATE_PREDICATE_W,
>>>   +    WRITE_W,
>>>   +    OMOD_W,
>>>   +    DST_REL_W,
>>>   +    CLAMP_W,
>>>   +    SRC0_W,
>>>   +    SRC0_NEG_W,
>>>   +    SRC0_REL_W,
>>>   +    SRC0_ABS_W,
>>>   +    SRC0_SEL_W,
>>>   +    SRC1_W,
>>>   +    SRC1_NEG_W,
>>>   +    SRC1_REL_W,
>>>   +    SRC1_ABS_W,
>>>   +    SRC1_SEL_W,
>>>   +    PRED_SEL_W,
>>>   +    IMM_0,
>>>   +    IMM_1,
>>>   +    VEC_COUNT
>>>   + };
>>>   +
>>>     }
>>>       #endif // R600DEFINES_H_
>>>   diff --git a/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>> b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>>   index f8c900f..993bdad 100644
>>>   --- a/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>>   +++ b/lib/Target/R600/R600ExpandSpecialInstrs.cpp
>>>   @@ -182,6 +182,31 @@ bool
>> R600ExpandSpecialInstrsPass::runOnMachineFunction(MachineFunction &MF) {
>>>             MI.eraseFromParent();
>>>             continue;
>>>             }
>>>   +      case AMDGPU::DOT_4: {
>>>   +
>>>   +        const R600RegisterInfo &TRI = TII->getRegisterInfo();
>>>   +
>>>   +        unsigned DstReg = MI.getOperand(0).getReg();
>>>   +        unsigned DstBase = TRI.getEncodingValue(DstReg) & HW_REG_MASK;
>>>   +
>>>   +        for (unsigned Chan = 0; Chan < 4; ++Chan) {
>>>   +          bool Mask = (Chan != TRI.getHWRegChan(DstReg));
>>>   +          unsigned SubDstReg =
>>>   +              AMDGPU::R600_TReg32RegClass.getRegister((DstBase * 4) +
>> Chan);
>>>   +          MachineInstr *BMI =
>>>   +              TII->buildSlotOfVectorInstruction(MBB, &MI, Chan,
>> SubDstReg);
>>>   +          if (Chan > 0) {
>>>   +            BMI->bundleWithPred();
>>>   +          }
>>>   +          if (Mask) {
>>>   +            TII->addFlag(BMI, 0, MO_FLAG_MASK);
>>>   +          }
>>>   +          if (Chan != 3)
>>>   +            TII->addFlag(BMI, 0, MO_FLAG_NOT_LAST);
>>>   +        }
>>>   +        MI.eraseFromParent();
>>>   +        continue;
>>>   +      }
>>>           }
>>>             bool IsReduction = TII->isReductionOp(MI.getOpcode());
>>>   diff --git a/lib/Target/R600/R600ISelLowering.cpp
>> b/lib/Target/R600/R600ISelLowering.cpp
>>>   index a73691d..4868dc7 100644
>>>   --- a/lib/Target/R600/R600ISelLowering.cpp
>>>   +++ b/lib/Target/R600/R600ISelLowering.cpp
>>>   @@ -394,6 +394,27 @@ SDValue R600TargetLowering::LowerOperation(SDValue Op,
>> SelectionDAG &DAG) const
>>>             return SDValue(interp, slot % 2);
>>>         }
>>>   +    case AMDGPUIntrinsic::AMDGPU_dp4: {
>>>   +      SDValue Args[8] = {
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(1),
>>>   +          DAG.getConstant(0, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(2),
>>>   +          DAG.getConstant(0, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(1),
>>>   +          DAG.getConstant(1, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(2),
>>>   +          DAG.getConstant(1, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(1),
>>>   +          DAG.getConstant(2, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(2),
>>>   +          DAG.getConstant(2, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(1),
>>>   +          DAG.getConstant(3, MVT::i32)),
>>>   +      DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, Op.getOperand(2),
>>>   +          DAG.getConstant(3, MVT::i32))
>>>   +      };
>>>   +      return DAG.getNode(AMDGPUISD::DOT4, DL, MVT::f32, Args, 8);
>>>   +    }
>>>           case r600_read_ngroups_x:
>>>           return LowerImplicitParameter(DAG, VT, DL, 0);
>>>   diff --git a/lib/Target/R600/R600InstrInfo.cpp
>> b/lib/Target/R600/R600InstrInfo.cpp
>>>   index 0865098..f686c5c 100644
>>>   --- a/lib/Target/R600/R600InstrInfo.cpp
>>>   +++ b/lib/Target/R600/R600InstrInfo.cpp
>>>   @@ -686,6 +686,94 @@ MachineInstrBuilder
>> R600InstrInfo::buildDefaultInstruction(MachineBasicBlock &MB
>>>       return MIB;
>>>     }
>>>     +#define OPERAND_CASE(Label) \
>>>   +  case Label: { \
>>>   +    static const R600Operands::VecOps Ops[] = \
>>>   +    { \
>>>   +      Label##_X, \
>>>   +      Label##_Y, \
>>>   +      Label##_Z, \
>>>   +      Label##_W \
>>>   +    }; \
>>>   +    return Ops[Slot]; \
>>>   +  }
>>>   +
>>>   +static R600Operands::VecOps
>>>   +getSlotedOps(R600Operands::Ops Op, unsigned Slot) {
>>>   +  switch (Op) {
>>>   +  OPERAND_CASE(R600Operands::UPDATE_EXEC_MASK)
>>>   +  OPERAND_CASE(R600Operands::UPDATE_PREDICATE)
>>>   +  OPERAND_CASE(R600Operands::WRITE)
>>>   +  OPERAND_CASE(R600Operands::OMOD)
>>>   +  OPERAND_CASE(R600Operands::DST_REL)
>>>   +  OPERAND_CASE(R600Operands::CLAMP)
>>>   +  OPERAND_CASE(R600Operands::SRC0)
>>>   +  OPERAND_CASE(R600Operands::SRC0_NEG)
>>>   +  OPERAND_CASE(R600Operands::SRC0_REL)
>>>   +  OPERAND_CASE(R600Operands::SRC0_ABS)
>>>   +  OPERAND_CASE(R600Operands::SRC0_SEL)
>>>   +  OPERAND_CASE(R600Operands::SRC1)
>>>   +  OPERAND_CASE(R600Operands::SRC1_NEG)
>>>   +  OPERAND_CASE(R600Operands::SRC1_REL)
>>>   +  OPERAND_CASE(R600Operands::SRC1_ABS)
>>>   +  OPERAND_CASE(R600Operands::SRC1_SEL)
>>>   +  OPERAND_CASE(R600Operands::PRED_SEL)
>>>   +  default:
>>>   +    llvm_unreachable("Wrong Operand");
>>>   +  }
>>>   +}
>>>   +
>>>   +#undef OPERAND_CASE
>>>   +
>>>   +static int
>>>   +getVecOperandIdx(R600Operands::VecOps Op) {
>>>   +  return 1 + Op;
>>>   +}
>>>   +
>>>   +
>>>   +MachineInstr *R600InstrInfo::buildSlotOfVectorInstruction(
>>>   +    MachineBasicBlock &MBB, MachineInstr *MI, unsigned Slot, unsigned
>> DstReg)
>>>   +    const {
>>>   +  assert (MI->getOpcode() == AMDGPU::DOT_4 && "Not
>> Implemented");
>>>   +  unsigned Opcode;
>>>   +  const AMDGPUSubtarget &ST =
>> TM.getSubtarget<AMDGPUSubtarget>();
>>>   +  if (ST.device()->getGeneration() <= AMDGPUDeviceInfo::HD4XXX)
>>>   +    Opcode = AMDGPU::DOT4_r600_real;
>>>   +  else
>>>   +    Opcode = AMDGPU::DOT4_eg_real;
>>>   +  MachineBasicBlock::iterator I = MI;
>>>   +  MachineOperand &Src0 = MI->getOperand(
>>>   +      getVecOperandIdx(getSlotedOps(R600Operands::SRC0, Slot)));
>>>   +  MachineOperand &Src1 = MI->getOperand(
>>>   +      getVecOperandIdx(getSlotedOps(R600Operands::SRC1, Slot)));
>>>   +  MachineInstr *MIB = buildDefaultInstruction(
>>>   +      MBB, I, Opcode, DstReg, Src0.getReg(), Src1.getReg());
>>>   +  static const R600Operands::Ops Operands[14] = {
>>>   +    R600Operands::UPDATE_EXEC_MASK,
>>>   +    R600Operands::UPDATE_PREDICATE,
>>>   +    R600Operands::WRITE,
>>>   +    R600Operands::OMOD,
>>>   +    R600Operands::DST_REL,
>>>   +    R600Operands::CLAMP,
>>>   +    R600Operands::SRC0_NEG,
>>>   +    R600Operands::SRC0_REL,
>>>   +    R600Operands::SRC0_ABS,
>>>   +    R600Operands::SRC0_SEL,
>>>   +    R600Operands::SRC1_NEG,
>>>   +    R600Operands::SRC1_REL,
>>>   +    R600Operands::SRC1_ABS,
>>>   +    R600Operands::SRC1_SEL,
>>>   +  };
>>>   +
>>>   +  for (unsigned i = 0; i < 14; i++) {
>>>   +    MachineOperand &MO = MI->getOperand(
>>>   +        getVecOperandIdx(getSlotedOps(Operands[i], Slot)));
>>>   +    assert (MO.isImm());
>>>   +    setImmOperand(MIB, Operands[i], MO.getImm());
>>>   +  }
>>>   +  return MIB;
>>>   +}
>>>   +
>>>     MachineInstr *R600InstrInfo::buildMovImm(MachineBasicBlock &BB,
>>>                                              MachineBasicBlock::iterator I,
>>>                                              unsigned DstReg,
>>>   diff --git a/lib/Target/R600/R600InstrInfo.h
>> b/lib/Target/R600/R600InstrInfo.h
>>>   index bf9569e..e38ed00 100644
>>>   --- a/lib/Target/R600/R600InstrInfo.h
>>>   +++ b/lib/Target/R600/R600InstrInfo.h
>>>   @@ -160,6 +160,11 @@ namespace llvm {
>>>                                                   unsigned Src0Reg,
>>>                                                   unsigned Src1Reg = 0)
>> const;
>>>     +  MachineInstr *buildSlotOfVectorInstruction(MachineBasicBlock &MBB,
>>>   +                                             MachineInstr *MI,
>>>   +                                             unsigned Slot,
>>>   +                                             unsigned DstReg) const;
>>>   +
>>>       MachineInstr *buildMovImm(MachineBasicBlock &BB,
>>>                                       MachineBasicBlock::iterator I,
>>>                                       unsigned DstReg,
>>>   diff --git a/lib/Target/R600/R600Instructions.td
>> b/lib/Target/R600/R600Instructions.td
>>>   index c5fa334..95a99a6 100644
>>>   --- a/lib/Target/R600/R600Instructions.td
>>>   +++ b/lib/Target/R600/R600Instructions.td
>>>   @@ -516,6 +516,13 @@ def CONST_ADDRESS:
>> SDNode<"AMDGPUISD::CONST_ADDRESS",
>>>       [SDNPVariadic]
>>>     >;
>>>     +def DOT4 : SDNode<"AMDGPUISD::DOT4",
>>>   +  SDTypeProfile<1, 8, [SDTCisFP<0>, SDTCisVT<1, f32>,
>> SDTCisVT<2, f32>,
>>>   +      SDTCisVT<3, f32>, SDTCisVT<4, f32>, SDTCisVT<5,
>> f32>,
>>>   +      SDTCisVT<6, f32>, SDTCisVT<7, f32>, SDTCisVT<8,
>> f32>]>,
>>>   +  []
>>>   +>;
>>>   +
>>>    
>> //===----------------------------------------------------------------------===//
>>>     // Interpolation Instructions
>>>    
>> //===----------------------------------------------------------------------===//
>>>   @@ -982,12 +989,54 @@ class CNDGE_Common <bits<5> inst> :
>> R600_3OP <
>>>            COND_GE))]
>>>     >;
>>>     +
>>>   +let isCodeGenOnly = 1, isPseudo = 1, Namespace = "AMDGPU"  in {
>>>   +class R600_VEC2OP<list<dag> pattern> : InstR600 <0, (outs
>> R600_Reg32:$dst), (ins
>>>   +// Slot X
>>>   +   UEM:$update_exec_mask_X, UP:$update_pred_X, WRITE:$write_X,
>>>   +   OMOD:$omod_X, REL:$dst_rel_X, CLAMP:$clamp_X,
>>>   +   R600_TReg32_X:$src0_X, NEG:$src0_neg_X, REL:$src0_rel_X,
>> ABS:$src0_abs_X, SEL:$src0_sel_X,
>>>   +   R600_TReg32_X:$src1_X, NEG:$src1_neg_X, REL:$src1_rel_X,
>> ABS:$src1_abs_X, SEL:$src1_sel_X,
>>>   +   R600_Pred:$pred_sel_X,
>>>   +// Slot Y
>>>   +   UEM:$update_exec_mask_Y, UP:$update_pred_Y, WRITE:$write_Y,
>>>   +   OMOD:$omod_Y, REL:$dst_rel_Y, CLAMP:$clamp_Y,
>>>   +   R600_TReg32_Y:$src0_Y, NEG:$src0_neg_Y, REL:$src0_rel_Y,
>> ABS:$src0_abs_Y, SEL:$src0_sel_Y,
>>>   +   R600_TReg32_Y:$src1_Y, NEG:$src1_neg_Y, REL:$src1_rel_Y,
>> ABS:$src1_abs_Y, SEL:$src1_sel_Y,
>>>   +   R600_Pred:$pred_sel_Y,
>>>   +// Slot Z
>>>   +   UEM:$update_exec_mask_Z, UP:$update_pred_Z, WRITE:$write_Z,
>>>   +   OMOD:$omod_Z, REL:$dst_rel_Z, CLAMP:$clamp_Z,
>>>   +   R600_TReg32_Z:$src0_Z, NEG:$src0_neg_Z, REL:$src0_rel_Z,
>> ABS:$src0_abs_Z, SEL:$src0_sel_Z,
>>>   +   R600_TReg32_Z:$src1_Z, NEG:$src1_neg_Z, REL:$src1_rel_Z,
>> ABS:$src1_abs_Z, SEL:$src1_sel_Z,
>>>   +   R600_Pred:$pred_sel_Z,
>>>   +// Slot W
>>>   +   UEM:$update_exec_mask_W, UP:$update_pred_W, WRITE:$write_W,
>>>   +   OMOD:$omod_W, REL:$dst_rel_W, CLAMP:$clamp_W,
>>>   +   R600_TReg32_W:$src0_W, NEG:$src0_neg_W, REL:$src0_rel_W,
>> ABS:$src0_abs_W, SEL:$src0_sel_W,
>>>   +   R600_TReg32_W:$src1_W, NEG:$src1_neg_W, REL:$src1_rel_W,
>> ABS:$src1_abs_W, SEL:$src1_sel_W,
>>>   +   R600_Pred:$pred_sel_W,
>>>   +   LITERAL:$literal0, LITERAL:$literal1),
>>>   +  "",
>>>   +  pattern,
>>>   +  AnyALU> {}
>>>   +}
>>>   +
>>>   +def DOT_4 : R600_VEC2OP<[(set R600_Reg32:$dst, (DOT4
>>>   +  R600_TReg32_X:$src0_X, R600_TReg32_X:$src1_X,
>>>   +  R600_TReg32_Y:$src0_Y, R600_TReg32_Y:$src1_Y,
>>>   +  R600_TReg32_Z:$src0_Z, R600_TReg32_Z:$src1_Z,
>>>   +  R600_TReg32_W:$src0_W, R600_TReg32_W:$src1_W))]>;
>>>   +
>>>   +
>>>   +
>>>   +
>>>     multiclass DOT4_Common <bits<11> inst> {
>>>         def _pseudo : R600_REDUCTION <inst,
>>>         (ins R600_Reg128:$src0, R600_Reg128:$src1),
>>>         "DOT4 $dst $src0, $src1",
>>>   -    [(set R600_Reg32:$dst, (int_AMDGPU_dp4 R600_Reg128:$src0,
>> R600_Reg128:$src1))]
>>>   +    []
>>>       >;
>>>         def _real : R600_2OP <inst, "DOT4", []>;
>>>   diff --git a/lib/Target/R600/R600MachineScheduler.cpp
>> b/lib/Target/R600/R600MachineScheduler.cpp
>>>   index e515d3e..5a18de9 100644
>>>   --- a/lib/Target/R600/R600MachineScheduler.cpp
>>>   +++ b/lib/Target/R600/R600MachineScheduler.cpp
>>>   @@ -407,6 +407,7 @@ R600SchedStrategy::AluKind
>> R600SchedStrategy::getAluKind(SUnit *SU) const {
>>>         case AMDGPU::INTERP_PAIR_XY:
>>>         case AMDGPU::INTERP_PAIR_ZW:
>>>         case AMDGPU::INTERP_VEC_LOAD:
>>>   +    case AMDGPU::DOT_4:
>>>           return AluT_XYZW;
>>>         case AMDGPU::COPY:
>>>           if (MI->getOperand(1).isUndef()) {
>>>   @@ -471,6 +472,7 @@ int R600SchedStrategy::getInstKind(SUnit* SU) {
>>>       case AMDGPU::INTERP_VEC_LOAD:
>>>       case AMDGPU::DOT4_eg_pseudo:
>>>       case AMDGPU::DOT4_r600_pseudo:
>>>   +  case AMDGPU::DOT_4:
>>>         return IDAlu;
>>>       case AMDGPU::TEX_VTX_CONSTBUF:
>>>       case AMDGPU::TEX_VTX_TEXBUF: