R600/SI Patchset: Initial support for compute shaders

Mon Mar 11 09:24:36 PDT 2013

Am 11.03.2013 14:59, schrieb Tom Stellard:
> On Mon, Mar 11, 2013 at 10:03:53AM +0100, Christian K?nig wrote:
>> Am 08.03.2013 22:29, schrieb Tom Stellard:
>>> Hi,
>>>
>>> The attached patches add support for the BUFFER_STORE instruction and
>>> make it possible to run simple compute shaders on Southern Islands
>>> hardware with the open source radeonsi GPU driver in
>>> Mesa (http://www.mesa3d.org).
>>>
>>> -Tom
>>>
>>> 0001-R600-SI-Avoid-generating-S_MOVs-with-64-bit-immediat.patch
>>>
>>>
>>> >From aa3e075e2f0df66e2fe50018704aee28904155f0 Mon Sep 17 00:00:00 2001
>>> From: Tom Stellard<thomas.stellard at amd.com>
>>> Date: Wed, 6 Mar 2013 17:28:55 +0000
>>> Subject: [PATCH 1/5] R600/SI: Avoid generating S_MOVs with 64-bit immediates
>>>
>>> SITargetLowering::analyzeImmediate() was converting the 64-bit values
>>> to 32-bit and then checking if they were an inline immediate.  Some
>>> of these conversions caused this check to succeed and produced
>>> S_MOV instructions with 64-bit immediates, which are illegal.
>>> ---
>>>    lib/Target/R600/SIISelLowering.cpp |   11 +++++++++--
>>>    test/CodeGen/R600/imm.ll           |   34 ++++++++++++++++++++++++++++++++++
>>>    2 files changed, 43 insertions(+), 2 deletions(-)
>>>    create mode 100644 test/CodeGen/R600/imm.ll
>>>
>>> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
>>> index fead115..74b1dc5 100644
>>> --- a/lib/Target/R600/SIISelLowering.cpp
>>> +++ b/lib/Target/R600/SIISelLowering.cpp
>>> @@ -426,9 +426,16 @@ int32_t SITargetLowering::analyzeImmediate(const SDNode *N) const {
>>>        float F;
>>>      } Imm;
>>>    
>>> -  if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N))
>>> +  if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N)) {
>>> +    if (Node->getValueType(0) == MVT::i64) {
>>> +      int64_t Imm64 = Node->getZExtValue();
>>> +      if (Imm64 >> 32) {
>>> +        return -1;
>>> +      }
>>> +      Imm.I = Imm64;
>>> +    }
>>>        Imm.I = Node->getSExtValue();
>> I wouldn't write it like this, setting Imm.I twice is a bit superfluous.
>> Also I don't think we should look at the ValueType, maybe something like
>> this should be more appropriate:
>>
>> if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N)) {
>>     int64_t Imm64 = Node->getSExtValue();
>>     if (Imm64 >> 32)
>>       return -1;
>>     Imm.I = Imm64;
>> } else...
>>
> This make sense, but what if this function were used on a 32-bit
> immediate?  Wouldn't if(Imm64 >> 32) be true for all negative integers?
> If so, then we would miss some opportunities for using inline
> immediates.

Ups, yeah your right. But checking for i64 type isn't 100% correct 
either, cause the inline immediates are definately sign extended.

>> That should also work if anybody ever has the stupid idea to put an i64 immediate onto a *_F64 operation.
>>
>>> [SNIP]
>> Oh no, please not! Do we really need that? For the indirect stuff I put
>> quite allot of effort into keeping the resource descriptors out of the
>> LLVM backend.
>>
>> My main reason was that even if we today only use v1i32 or v2i32 (or
>> v2f32, etc...) buffers that stuff really belongs into the driver and not
>> the compile, cause the driver is really setting up the buffers
>> (including handling all the restrictions for tilling, alignment, padding
>> etc...).
>>
> I'm not really too familiar with tiling, which resource fields are used
> for tiling?
>
> The reasons I wanted to construct the resource constant in the compiler
> are:
>
> 1. The ability to use the same virtual address for different typed loads.  For
> example if I have 5 i32 loads from a buffer and LLVM optimizes them to
> 1 v4i32 load and 1 i32 load there is really no good way to tell the
> driver to send different resource descriptors for each loaded type.

For this case I would suggest to just use BUFFER_LOAD_DWORD (X2, X4) 
instead, should save us allot of headache with the resource descriptors 
data and number format.

>
> 2. Allows us to pass 64-bit pointer values to the shader and avoid
> having to deal with arithmetic on 128-bit pointers.  This might not be
> too hard to solve, but it would require some additional work.

Why not use the ADDR64 bit in the MUBUF for this instead? If I'm not 
completely wrong the resource destribtor can be just a "S_MOV_B64 0" in 
this case.

>
> -Tom
>
>>> diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
>>> index 74b1dc5..3f0e27d 100644
>>> --- a/lib/Target/R600/SIISelLowering.cpp
>>> +++ b/lib/Target/R600/SIISelLowering.cpp
>>> @@ -16,6 +16,7 @@
>>>    #include "AMDIL.h"
>>>    #include "AMDGPU.h"
>>>    #include "AMDILIntrinsicInfo.h"
>>> +#include "SIDefines.h"
>>>    #include "SIInstrInfo.h"
>>>    #include "SIMachineFunctionInfo.h"
>>>    #include "SIRegisterInfo.h"
>>> @@ -49,6 +50,7 @@ SITargetLowering::SITargetLowering(TargetMachine &TM) :
>>>    
>>>      addRegisterClass(MVT::v4i32, &AMDGPU::VReg_128RegClass);
>>>      addRegisterClass(MVT::v4f32, &AMDGPU::VReg_128RegClass);
>>> +  addRegisterClass(MVT::i128, &AMDGPU::SReg_128RegClass);
>>>    
>>>      addRegisterClass(MVT::v8i32, &AMDGPU::VReg_256RegClass);
>>>      addRegisterClass(MVT::v8f32, &AMDGPU::VReg_256RegClass);
>>> @@ -68,6 +70,9 @@ SITargetLowering::SITargetLowering(TargetMachine &TM) :
>>>      setTargetDAGCombine(ISD::SELECT_CC);
>>>    
>>>      setTargetDAGCombine(ISD::SETCC);
>>> +
>>> +  setOperationAction(ISD::STORE, MVT::i32, Custom);
>>> +  setOperationAction(ISD::STORE, MVT::i64, Custom);
>>>    }
>>>    
>>>    SDValue SITargetLowering::LowerFormalArguments(
>>> @@ -236,6 +241,7 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
>>>      default: return AMDGPUTargetLowering::LowerOperation(Op, DAG);
>>>      case ISD::BRCOND: return LowerBRCOND(Op, DAG);
>>>      case ISD::SELECT_CC: return LowerSELECT_CC(Op, DAG);
>>> +  case ISD::STORE: return LowerSTORE(Op, DAG);
>>>      }
>>>      return SDValue();
>>>    }
>>> @@ -334,6 +340,63 @@ SDValue SITargetLowering::LowerBRCOND(SDValue BRCOND,
>>>      return Chain;
>>>    }
>>>    
>>> +SDValue SITargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
>>> +  StoreSDNode *StoreNode = cast<StoreSDNode>(Op);
>>> +  SDValue Chain = Op.getOperand(0);
>>> +  SDValue Value = Op.getOperand(1);
>>> +  SDValue VirtualAddress = Op.getOperand(2);
>>> +  EVT VT = Value.getValueType();
>>> +  DebugLoc DL = Op.getDebugLoc();
>>> +
>>> +  if (StoreNode->getAddressSpace() != AMDGPUAS::GLOBAL_ADDRESS) {
>>> +    return SDValue();
>>> +  }
>>> +
>>> +  // 128-bit resource constants are used to identify buffers for loads and
>>> +  // stores.  The resource constant is composed of a 48-bit virtual address
>>> +  // plus 80-bits of information about the buffer.
>>> +
>>> +  // Set NumRecords to the maximum allowed value.  This way we don't need to
>>> +  // pass the number of records to he program and it should work for all
>>> +  // buffer sizes.
>>> +  //
>>> +  // XXX: Is there any disadvantage to doing this?
>>> +  //
>>> +  uint64_t RsrcWord2 = 0xFFFFFFFF; //NUM_RECORDS
>>> +
>>> +  unsigned DataFormat;
>>> +  switch (VT.getSizeInBits()) {
>>> +  case 32: DataFormat = V_008F0C_BUF_DATA_FORMAT_32; break;
>>> +  case 64: DataFormat = V_008F0C_BUF_DATA_FORMAT_32_32; break;
>>> +  default: llvm_unreachable("Unhandle type size in store");
>>> +  }
>>> +
>>> +  uint64_t RsrcWord3 =
>>> +    // Always set the destination select values to XYZW.
>>> +    // XXX: Is there any disadvantage to doing this?
>>> +    S_008F0C_DST_SEL_X(V_008F0C_SQ_SEL_X) |
>>> +    S_008F0C_DST_SEL_Y(V_008F0C_SQ_SEL_Y) |
>>> +    S_008F0C_DST_SEL_Z(V_008F0C_SQ_SEL_Z) |
>>> +    S_008F0C_DST_SEL_W(V_008F0C_SQ_SEL_W) |
>>> +    // NUM_FORMAT_UINT appears to work even if the storage type is float,
>>> +    // so we'll use it for everything.
>>> +    S_008F0C_NUM_FORMAT(V_008F0C_BUF_NUM_FORMAT_UINT) |
>>> +    S_008F0C_DATA_FORMAT(DataFormat);
>>> +
>>> +  SDValue RsrcWord23 = DAG.getConstant(RsrcWord2 | (RsrcWord3 << 32), MVT::i64);
>>> +
>>> +  SDValue SrcSrc = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i128,
>>> +                               VirtualAddress, RsrcWord23);
>>> +
>>> +  SDValue Ops[2];
>>> +  Ops[0] = DAG.getNode(AMDGPUISD::BUFFER_STORE, DL, MVT::Other, Chain,
>>> +                       Value, SrcSrc, DAG.getConstant(0, MVT::i32));
>>> +  Ops[1] = Chain;
>>> +
>>> +  return DAG.getMergeValues(Ops, 2, DL);
>>> +
>>> +}
>>> +
>>>    SDValue SITargetLowering::LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const {
>>>      SDValue LHS = Op.getOperand(0);
>>>      SDValue RHS = Op.getOperand(1);
>>> diff --git a/lib/Target/R600/SIISelLowering.h b/lib/Target/R600/SIISelLowering.h
>>> index 0411565..ef0a8dc 100644
>>> --- a/lib/Target/R600/SIISelLowering.h
>>> +++ b/lib/Target/R600/SIISelLowering.h
>>> @@ -27,6 +27,7 @@ class SITargetLowering : public AMDGPUTargetLowering {
>>>      void LowerSI_WQM(MachineInstr *MI, MachineBasicBlock &BB,
>>>                  MachineBasicBlock::iterator I, MachineRegisterInfo & MRI) const;
>>>    
>>> +  SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
>>>      SDValue LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const;
>>>      SDValue LowerBRCOND(SDValue Op, SelectionDAG &DAG) const;
>>>    
>>> diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
>>> index d0cd468..02452ca 100644
>>> --- a/lib/Target/R600/SIInstrInfo.td
>>> +++ b/lib/Target/R600/SIInstrInfo.td
>>> @@ -26,6 +26,10 @@ def HI32 : SDNodeXForm<imm, [{
>>>      return CurDAG->getTargetConstant(N->getZExtValue() >> 32, MVT::i32);
>>>    }]>;
>>>    
>>> +def SIbuffer_store : SDNode<"AMDGPUISD::BUFFER_STORE",
>>> +                           SDTypeProfile<0, 3, [SDTCisPtrTy<1>, SDTCisInt<2>]>,
>>> +                           [SDNPHasChain, SDNPMayStore]>;
>>> +
>>>    def IMM8bitDWORD : ImmLeaf <
>>>      i32, [{
>>>        return (Imm & ~0x3FC) == 0;
>>> @@ -296,6 +300,28 @@ class MUBUF_Load_Helper <bits<7> op, string name, RegisterClass regClass,
>>>      let soffset = 128; // ZERO
>>>    }
>>>    
>>> +class MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass,
>>> +                         ValueType VT> :
>>> +    MUBUF <op, (outs), (ins vdataClass:$vdata, SReg_128:$srsrc, VReg_32:$vaddr),
>>> +          name#" $vdata, $srsrc + $vaddr",
>>> +          [(SIbuffer_store (VT vdataClass:$vdata), (i128 SReg_128:$srsrc),
>>> +                                                    (i32 VReg_32:$vaddr))]> {
>>> +
>>> +  let mayLoad = 0;
>>> +  let mayStore = 1;
>>> +
>>> +  // Encoding
>>> +  let offset = 0;
>>> +  let offen = 1;
>>> +  let idxen = 0;
>>> +  let glc = 0;
>>> +  let addr64 = 0;
>>> +  let lds = 0;
>>> +  let slc = 0;
>>> +  let tfe = 0;
>>> +  let soffset = 128; // ZERO
>>> +}
>>> +
>>>    class MTBUF_Load_Helper <bits<3> op, string asm, RegisterClass regClass> : MTBUF <
>>>      op,
>>>      (outs regClass:$dst),
>>> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
>>> index 44f35b9..a01e096 100644
>>> --- a/lib/Target/R600/SIInstructions.td
>>> +++ b/lib/Target/R600/SIInstructions.td
>>> @@ -414,8 +414,14 @@ def BUFFER_LOAD_FORMAT_XYZW : MUBUF_Load_Helper <
>>>    //def BUFFER_LOAD_DWORDX4 : MUBUF_DWORDX4 <0x0000000e, "BUFFER_LOAD_DWORDX4", []>;
>>>    //def BUFFER_STORE_BYTE : MUBUF_ <0x00000018, "BUFFER_STORE_BYTE", []>;
>>>    //def BUFFER_STORE_SHORT : MUBUF_ <0x0000001a, "BUFFER_STORE_SHORT", []>;
>>> -//def BUFFER_STORE_DWORD : MUBUF_ <0x0000001c, "BUFFER_STORE_DWORD", []>;
>>> -//def BUFFER_STORE_DWORDX2 : MUBUF_DWORDX2 <0x0000001d, "BUFFER_STORE_DWORDX2", []>;
>>> +
>>> +def BUFFER_STORE_DWORD : MUBUF_Store_Helper <
>>> +  0x0000001c, "BUFFER_STORE_DWORD", VReg_32, i32
>>> +>;
>>> +
>>> +def BUFFER_STORE_DWORDX2 : MUBUF_Store_Helper <
>>> +  0x0000001d, "BUFFER_STORE_DWORDX2", VReg_64, i64
>>> +>;
>>>    //def BUFFER_STORE_DWORDX4 : MUBUF_DWORDX4 <0x0000001e, "BUFFER_STORE_DWORDX4", []>;
>>>    //def BUFFER_ATOMIC_SWAP : MUBUF_ <0x00000030, "BUFFER_ATOMIC_SWAP", []>;
>>>    //def BUFFER_ATOMIC_CMPSWAP : MUBUF_ <0x00000031, "BUFFER_ATOMIC_CMPSWAP", []>;
>>> diff --git a/lib/Target/R600/SIRegisterInfo.td b/lib/Target/R600/SIRegisterInfo.td
>>> index 3dcad50..af74123 100644
>>> --- a/lib/Target/R600/SIRegisterInfo.td
>>> +++ b/lib/Target/R600/SIRegisterInfo.td
>>> @@ -151,7 +151,7 @@ def SReg_64 : RegisterClass<"AMDGPU", [i64, i1], 64,
>>>      (add SGPR_64, VCCReg, EXECReg)
>>>    >;
>>>    
>>> -def SReg_128 : RegisterClass<"AMDGPU", [v16i8], 128, (add SGPR_128)>;
>>> +def SReg_128 : RegisterClass<"AMDGPU", [v16i8, i128], 128, (add SGPR_128)>;
>>>    
>>>    def SReg_256 : RegisterClass<"AMDGPU", [v32i8], 256, (add SGPR_256)>;
>>>    
>>> diff --git a/test/CodeGen/R600/imm.ll b/test/CodeGen/R600/imm.ll
>>> index 20f9dec..0e19e78 100644
>>> --- a/test/CodeGen/R600/imm.ll
>>> +++ b/test/CodeGen/R600/imm.ll
>>> @@ -1,8 +1,5 @@
>>>    ; RUN: llc < %s -march=r600 -mcpu=SI | FileCheck %s
>>>    
>>> -; XXX: Enable once SI supports buffer stores
>>> -; XFAIL: *
>>> -
>>>    ; CHECK: @i64_imm
>>>    ; CHECK-NOT: S_MOV_B64
>>>    define void @i64_imm(i64 addrspace(1)* %out) {
>>> diff --git a/test/CodeGen/R600/store.ll b/test/CodeGen/R600/store.ll
>>> new file mode 100644
>>> index 0000000..e5811ea
>>> --- /dev/null
>>> +++ b/test/CodeGen/R600/store.ll
>>> @@ -0,0 +1,11 @@
>>> +; RUN: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG-CHECK %s
>>> +; RUN: llc < %s -march=r600 -mcpu=SI | FileCheck --check-prefix=SI-CHECK %s
>>> +
>>> +; CHECK: @store_float
>>> +; EG-CHECK: RAT_WRITE_CACHELESS_32_eg T{{[0-9]+\.X, T[0-9]+\.X}}, 1
>>> +; SI-CHECK: BUFFER_STORE_DWORD
>>> +
>>> +define void @store_float(float addrspace(1)* %out, float %in) {
>>> +  store float %in, float addrspace(1)* %out
>>> +  ret void
>>> +}
>>> -- 1.7.3.4
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits