R600/SI Patchset: Initial support for compute shaders

Mon Mar 11 12:20:07 PDT 2013

On Mon, Mar 11, 2013 at 05:24:36PM +0100, Christian König wrote:
> Am 11.03.2013 14:59, schrieb Tom Stellard:
> >On Mon, Mar 11, 2013 at 10:03:53AM +0100, Christian K?nig wrote:
> >>Am 08.03.2013 22:29, schrieb Tom Stellard:
> >>>Hi,
> >>>
> >>>The attached patches add support for the BUFFER_STORE instruction and
> >>>make it possible to run simple compute shaders on Southern Islands
> >>>hardware with the open source radeonsi GPU driver in
> >>>Mesa (http://www.mesa3d.org).
> >>>
> >>>-Tom
> >>>
> >>>0001-R600-SI-Avoid-generating-S_MOVs-with-64-bit-immediat.patch
> >>>
> >>>
> >>>>From aa3e075e2f0df66e2fe50018704aee28904155f0 Mon Sep 17 00:00:00 2001
> >>>From: Tom Stellard<thomas.stellard at amd.com>
> >>>Date: Wed, 6 Mar 2013 17:28:55 +0000
> >>>Subject: [PATCH 1/5] R600/SI: Avoid generating S_MOVs with 64-bit immediates
> >>>
> >>>SITargetLowering::analyzeImmediate() was converting the 64-bit values
> >>>to 32-bit and then checking if they were an inline immediate.  Some
> >>>of these conversions caused this check to succeed and produced
> >>>S_MOV instructions with 64-bit immediates, which are illegal.
> >>>---
> >>>   lib/Target/R600/SIISelLowering.cpp |   11 +++++++++--
> >>>   test/CodeGen/R600/imm.ll           |   34 ++++++++++++++++++++++++++++++++++
> >>>   2 files changed, 43 insertions(+), 2 deletions(-)
> >>>   create mode 100644 test/CodeGen/R600/imm.ll
> >>>
> >>>diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> >>>index fead115..74b1dc5 100644
> >>>--- a/lib/Target/R600/SIISelLowering.cpp
> >>>+++ b/lib/Target/R600/SIISelLowering.cpp
> >>>@@ -426,9 +426,16 @@ int32_t SITargetLowering::analyzeImmediate(const SDNode *N) const {
> >>>       float F;
> >>>     } Imm;
> >>>-  if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N))
> >>>+  if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N)) {
> >>>+    if (Node->getValueType(0) == MVT::i64) {
> >>>+      int64_t Imm64 = Node->getZExtValue();
> >>>+      if (Imm64 >> 32) {
> >>>+        return -1;
> >>>+      }
> >>>+      Imm.I = Imm64;
> >>>+    }
> >>>       Imm.I = Node->getSExtValue();
> >>I wouldn't write it like this, setting Imm.I twice is a bit superfluous.
> >>Also I don't think we should look at the ValueType, maybe something like
> >>this should be more appropriate:
> >>
> >>if (const ConstantSDNode *Node = dyn_cast<ConstantSDNode>(N)) {
> >>    int64_t Imm64 = Node->getSExtValue();
> >>    if (Imm64 >> 32)
> >>      return -1;
> >>    Imm.I = Imm64;
> >>} else...
> >>
> >This make sense, but what if this function were used on a 32-bit
> >immediate?  Wouldn't if(Imm64 >> 32) be true for all negative integers?
> >If so, then we would miss some opportunities for using inline
> >immediates.
> 
> Ups, yeah your right. But checking for i64 type isn't 100% correct
> either, cause the inline immediates are definately sign extended.
> 
> >>That should also work if anybody ever has the stupid idea to put an i64 immediate onto a *_F64 operation.
> >>
> >>>[SNIP]
> >>Oh no, please not! Do we really need that? For the indirect stuff I put
> >>quite allot of effort into keeping the resource descriptors out of the
> >>LLVM backend.
> >>
> >>My main reason was that even if we today only use v1i32 or v2i32 (or
> >>v2f32, etc...) buffers that stuff really belongs into the driver and not
> >>the compile, cause the driver is really setting up the buffers
> >>(including handling all the restrictions for tilling, alignment, padding
> >>etc...).
> >>
> >I'm not really too familiar with tiling, which resource fields are used
> >for tiling?
> >
> >The reasons I wanted to construct the resource constant in the compiler
> >are:
> >
> >1. The ability to use the same virtual address for different typed loads.  For
> >example if I have 5 i32 loads from a buffer and LLVM optimizes them to
> >1 v4i32 load and 1 i32 load there is really no good way to tell the
> >driver to send different resource descriptors for each loaded type.
> 
> For this case I would suggest to just use BUFFER_LOAD_DWORD (X2, X4)
> instead, should save us allot of headache with the resource
> descriptors data and number format.
>

OK, I'll give this a try.

> >
> >2. Allows us to pass 64-bit pointer values to the shader and avoid
> >having to deal with arithmetic on 128-bit pointers.  This might not be
> >too hard to solve, but it would require some additional work.
> 
> Why not use the ADDR64 bit in the MUBUF for this instead? If I'm not
> completely wrong the resource destribtor can be just a "S_MOV_B64 0"
> in this case.
> 

This was my first approach, but even with the ADDR64 bit I couldn't get the
instructions to work unless I added the number and data format to the resource
descriptor. 

-Tom

> >
> >-Tom
> >
> >>>diff --git a/lib/Target/R600/SIISelLowering.cpp b/lib/Target/R600/SIISelLowering.cpp
> >>>index 74b1dc5..3f0e27d 100644
> >>>--- a/lib/Target/R600/SIISelLowering.cpp
> >>>+++ b/lib/Target/R600/SIISelLowering.cpp
> >>>@@ -16,6 +16,7 @@
> >>>   #include "AMDIL.h"
> >>>   #include "AMDGPU.h"
> >>>   #include "AMDILIntrinsicInfo.h"
> >>>+#include "SIDefines.h"
> >>>   #include "SIInstrInfo.h"
> >>>   #include "SIMachineFunctionInfo.h"
> >>>   #include "SIRegisterInfo.h"
> >>>@@ -49,6 +50,7 @@ SITargetLowering::SITargetLowering(TargetMachine &TM) :
> >>>     addRegisterClass(MVT::v4i32, &AMDGPU::VReg_128RegClass);
> >>>     addRegisterClass(MVT::v4f32, &AMDGPU::VReg_128RegClass);
> >>>+  addRegisterClass(MVT::i128, &AMDGPU::SReg_128RegClass);
> >>>     addRegisterClass(MVT::v8i32, &AMDGPU::VReg_256RegClass);
> >>>     addRegisterClass(MVT::v8f32, &AMDGPU::VReg_256RegClass);
> >>>@@ -68,6 +70,9 @@ SITargetLowering::SITargetLowering(TargetMachine &TM) :
> >>>     setTargetDAGCombine(ISD::SELECT_CC);
> >>>     setTargetDAGCombine(ISD::SETCC);
> >>>+
> >>>+  setOperationAction(ISD::STORE, MVT::i32, Custom);
> >>>+  setOperationAction(ISD::STORE, MVT::i64, Custom);
> >>>   }
> >>>   SDValue SITargetLowering::LowerFormalArguments(
> >>>@@ -236,6 +241,7 @@ SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
> >>>     default: return AMDGPUTargetLowering::LowerOperation(Op, DAG);
> >>>     case ISD::BRCOND: return LowerBRCOND(Op, DAG);
> >>>     case ISD::SELECT_CC: return LowerSELECT_CC(Op, DAG);
> >>>+  case ISD::STORE: return LowerSTORE(Op, DAG);
> >>>     }
> >>>     return SDValue();
> >>>   }
> >>>@@ -334,6 +340,63 @@ SDValue SITargetLowering::LowerBRCOND(SDValue BRCOND,
> >>>     return Chain;
> >>>   }
> >>>+SDValue SITargetLowering::LowerSTORE(SDValue Op, SelectionDAG &DAG) const {
> >>>+  StoreSDNode *StoreNode = cast<StoreSDNode>(Op);
> >>>+  SDValue Chain = Op.getOperand(0);
> >>>+  SDValue Value = Op.getOperand(1);
> >>>+  SDValue VirtualAddress = Op.getOperand(2);
> >>>+  EVT VT = Value.getValueType();
> >>>+  DebugLoc DL = Op.getDebugLoc();
> >>>+
> >>>+  if (StoreNode->getAddressSpace() != AMDGPUAS::GLOBAL_ADDRESS) {
> >>>+    return SDValue();
> >>>+  }
> >>>+
> >>>+  // 128-bit resource constants are used to identify buffers for loads and
> >>>+  // stores.  The resource constant is composed of a 48-bit virtual address
> >>>+  // plus 80-bits of information about the buffer.
> >>>+
> >>>+  // Set NumRecords to the maximum allowed value.  This way we don't need to
> >>>+  // pass the number of records to he program and it should work for all
> >>>+  // buffer sizes.
> >>>+  //
> >>>+  // XXX: Is there any disadvantage to doing this?
> >>>+  //
> >>>+  uint64_t RsrcWord2 = 0xFFFFFFFF; //NUM_RECORDS
> >>>+
> >>>+  unsigned DataFormat;
> >>>+  switch (VT.getSizeInBits()) {
> >>>+  case 32: DataFormat = V_008F0C_BUF_DATA_FORMAT_32; break;
> >>>+  case 64: DataFormat = V_008F0C_BUF_DATA_FORMAT_32_32; break;
> >>>+  default: llvm_unreachable("Unhandle type size in store");
> >>>+  }
> >>>+
> >>>+  uint64_t RsrcWord3 =
> >>>+    // Always set the destination select values to XYZW.
> >>>+    // XXX: Is there any disadvantage to doing this?
> >>>+    S_008F0C_DST_SEL_X(V_008F0C_SQ_SEL_X) |
> >>>+    S_008F0C_DST_SEL_Y(V_008F0C_SQ_SEL_Y) |
> >>>+    S_008F0C_DST_SEL_Z(V_008F0C_SQ_SEL_Z) |
> >>>+    S_008F0C_DST_SEL_W(V_008F0C_SQ_SEL_W) |
> >>>+    // NUM_FORMAT_UINT appears to work even if the storage type is float,
> >>>+    // so we'll use it for everything.
> >>>+    S_008F0C_NUM_FORMAT(V_008F0C_BUF_NUM_FORMAT_UINT) |
> >>>+    S_008F0C_DATA_FORMAT(DataFormat);
> >>>+
> >>>+  SDValue RsrcWord23 = DAG.getConstant(RsrcWord2 | (RsrcWord3 << 32), MVT::i64);
> >>>+
> >>>+  SDValue SrcSrc = DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i128,
> >>>+                               VirtualAddress, RsrcWord23);
> >>>+
> >>>+  SDValue Ops[2];
> >>>+  Ops[0] = DAG.getNode(AMDGPUISD::BUFFER_STORE, DL, MVT::Other, Chain,
> >>>+                       Value, SrcSrc, DAG.getConstant(0, MVT::i32));
> >>>+  Ops[1] = Chain;
> >>>+
> >>>+  return DAG.getMergeValues(Ops, 2, DL);
> >>>+
> >>>+}
> >>>+
> >>>   SDValue SITargetLowering::LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const {
> >>>     SDValue LHS = Op.getOperand(0);
> >>>     SDValue RHS = Op.getOperand(1);
> >>>diff --git a/lib/Target/R600/SIISelLowering.h b/lib/Target/R600/SIISelLowering.h
> >>>index 0411565..ef0a8dc 100644
> >>>--- a/lib/Target/R600/SIISelLowering.h
> >>>+++ b/lib/Target/R600/SIISelLowering.h
> >>>@@ -27,6 +27,7 @@ class SITargetLowering : public AMDGPUTargetLowering {
> >>>     void LowerSI_WQM(MachineInstr *MI, MachineBasicBlock &BB,
> >>>                 MachineBasicBlock::iterator I, MachineRegisterInfo & MRI) const;
> >>>+  SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
> >>>     SDValue LowerSELECT_CC(SDValue Op, SelectionDAG &DAG) const;
> >>>     SDValue LowerBRCOND(SDValue Op, SelectionDAG &DAG) const;
> >>>diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
> >>>index d0cd468..02452ca 100644
> >>>--- a/lib/Target/R600/SIInstrInfo.td
> >>>+++ b/lib/Target/R600/SIInstrInfo.td
> >>>@@ -26,6 +26,10 @@ def HI32 : SDNodeXForm<imm, [{
> >>>     return CurDAG->getTargetConstant(N->getZExtValue() >> 32, MVT::i32);
> >>>   }]>;
> >>>+def SIbuffer_store : SDNode<"AMDGPUISD::BUFFER_STORE",
> >>>+                           SDTypeProfile<0, 3, [SDTCisPtrTy<1>, SDTCisInt<2>]>,
> >>>+                           [SDNPHasChain, SDNPMayStore]>;
> >>>+
> >>>   def IMM8bitDWORD : ImmLeaf <
> >>>     i32, [{
> >>>       return (Imm & ~0x3FC) == 0;
> >>>@@ -296,6 +300,28 @@ class MUBUF_Load_Helper <bits<7> op, string name, RegisterClass regClass,
> >>>     let soffset = 128; // ZERO
> >>>   }
> >>>+class MUBUF_Store_Helper <bits<7> op, string name, RegisterClass vdataClass,
> >>>+                         ValueType VT> :
> >>>+    MUBUF <op, (outs), (ins vdataClass:$vdata, SReg_128:$srsrc, VReg_32:$vaddr),
> >>>+          name#" $vdata, $srsrc + $vaddr",
> >>>+          [(SIbuffer_store (VT vdataClass:$vdata), (i128 SReg_128:$srsrc),
> >>>+                                                    (i32 VReg_32:$vaddr))]> {
> >>>+
> >>>+  let mayLoad = 0;
> >>>+  let mayStore = 1;
> >>>+
> >>>+  // Encoding
> >>>+  let offset = 0;
> >>>+  let offen = 1;
> >>>+  let idxen = 0;
> >>>+  let glc = 0;
> >>>+  let addr64 = 0;
> >>>+  let lds = 0;
> >>>+  let slc = 0;
> >>>+  let tfe = 0;
> >>>+  let soffset = 128; // ZERO
> >>>+}
> >>>+
> >>>   class MTBUF_Load_Helper <bits<3> op, string asm, RegisterClass regClass> : MTBUF <
> >>>     op,
> >>>     (outs regClass:$dst),
> >>>diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
> >>>index 44f35b9..a01e096 100644
> >>>--- a/lib/Target/R600/SIInstructions.td
> >>>+++ b/lib/Target/R600/SIInstructions.td
> >>>@@ -414,8 +414,14 @@ def BUFFER_LOAD_FORMAT_XYZW : MUBUF_Load_Helper <
> >>>   //def BUFFER_LOAD_DWORDX4 : MUBUF_DWORDX4 <0x0000000e, "BUFFER_LOAD_DWORDX4", []>;
> >>>   //def BUFFER_STORE_BYTE : MUBUF_ <0x00000018, "BUFFER_STORE_BYTE", []>;
> >>>   //def BUFFER_STORE_SHORT : MUBUF_ <0x0000001a, "BUFFER_STORE_SHORT", []>;
> >>>-//def BUFFER_STORE_DWORD : MUBUF_ <0x0000001c, "BUFFER_STORE_DWORD", []>;
> >>>-//def BUFFER_STORE_DWORDX2 : MUBUF_DWORDX2 <0x0000001d, "BUFFER_STORE_DWORDX2", []>;
> >>>+
> >>>+def BUFFER_STORE_DWORD : MUBUF_Store_Helper <
> >>>+  0x0000001c, "BUFFER_STORE_DWORD", VReg_32, i32
> >>>+>;
> >>>+
> >>>+def BUFFER_STORE_DWORDX2 : MUBUF_Store_Helper <
> >>>+  0x0000001d, "BUFFER_STORE_DWORDX2", VReg_64, i64
> >>>+>;
> >>>   //def BUFFER_STORE_DWORDX4 : MUBUF_DWORDX4 <0x0000001e, "BUFFER_STORE_DWORDX4", []>;
> >>>   //def BUFFER_ATOMIC_SWAP : MUBUF_ <0x00000030, "BUFFER_ATOMIC_SWAP", []>;
> >>>   //def BUFFER_ATOMIC_CMPSWAP : MUBUF_ <0x00000031, "BUFFER_ATOMIC_CMPSWAP", []>;
> >>>diff --git a/lib/Target/R600/SIRegisterInfo.td b/lib/Target/R600/SIRegisterInfo.td
> >>>index 3dcad50..af74123 100644
> >>>--- a/lib/Target/R600/SIRegisterInfo.td
> >>>+++ b/lib/Target/R600/SIRegisterInfo.td
> >>>@@ -151,7 +151,7 @@ def SReg_64 : RegisterClass<"AMDGPU", [i64, i1], 64,
> >>>     (add SGPR_64, VCCReg, EXECReg)
> >>>   >;
> >>>-def SReg_128 : RegisterClass<"AMDGPU", [v16i8], 128, (add SGPR_128)>;
> >>>+def SReg_128 : RegisterClass<"AMDGPU", [v16i8, i128], 128, (add SGPR_128)>;
> >>>   def SReg_256 : RegisterClass<"AMDGPU", [v32i8], 256, (add SGPR_256)>;
> >>>diff --git a/test/CodeGen/R600/imm.ll b/test/CodeGen/R600/imm.ll
> >>>index 20f9dec..0e19e78 100644
> >>>--- a/test/CodeGen/R600/imm.ll
> >>>+++ b/test/CodeGen/R600/imm.ll
> >>>@@ -1,8 +1,5 @@
> >>>   ; RUN: llc < %s -march=r600 -mcpu=SI | FileCheck %s
> >>>-; XXX: Enable once SI supports buffer stores
> >>>-; XFAIL: *
> >>>-
> >>>   ; CHECK: @i64_imm
> >>>   ; CHECK-NOT: S_MOV_B64
> >>>   define void @i64_imm(i64 addrspace(1)* %out) {
> >>>diff --git a/test/CodeGen/R600/store.ll b/test/CodeGen/R600/store.ll
> >>>new file mode 100644
> >>>index 0000000..e5811ea
> >>>--- /dev/null
> >>>+++ b/test/CodeGen/R600/store.ll
> >>>@@ -0,0 +1,11 @@
> >>>+; RUN: llc < %s -march=r600 -mcpu=redwood | FileCheck --check-prefix=EG-CHECK %s
> >>>+; RUN: llc < %s -march=r600 -mcpu=SI | FileCheck --check-prefix=SI-CHECK %s
> >>>+
> >>>+; CHECK: @store_float
> >>>+; EG-CHECK: RAT_WRITE_CACHELESS_32_eg T{{[0-9]+\.X, T[0-9]+\.X}}, 1
> >>>+; SI-CHECK: BUFFER_STORE_DWORD
> >>>+
> >>>+define void @store_float(float addrspace(1)* %out, float %in) {
> >>>+  store float %in, float addrspace(1)* %out
> >>>+  ret void
> >>>+}
> >>>-- 1.7.3.4
> >>>
> >>>
> >>>_______________________________________________
> >>>llvm-commits mailing list
> >>>llvm-commits at cs.uiuc.edu
> >>>http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>