PATCH: R600/SI: Fix handling of constant initializers

Tue Jul 15 10:39:01 PDT 2014

On 07/15/2014 07:20 AM, Tom Stellard wrote:
> On Mon, Jul 14, 2014 at 02:43:27PM -0700, Matt Arsenault wrote:
>> On 07/14/2014 01:57 PM, Tom Stellard wrote:
>>>   From 6deb5706328bef08e419970e9a4505de172d63c9 Mon Sep 17 00:00:00 2001
>>> From: Tom Stellard<thomas.stellard at amd.com>
>>> Date: Mon, 14 Jul 2014 16:09:50 -0400
>>> Subject: [PATCH 3/3] R600/SI: Store constant initializer data in constant
>>>    memory
>>>
>>> This implements a solution for constant initializers suggested
>>> by Vadim Girlin, where we store the data after the shader code
>>> and then use the S_GETPC instruction to compute its address.
>>>
>>> This saves use the trouble of creating a new buffer for constant data
>>> and then having to pass the pointer to the kernel via user SGPRs or the
>>> input buffer.
>> This strategy sounds weird. I'm not exactly sure how the caches work,
>> but will this end up putting the constant initializer in the instruction
>> cache? That would be really, really bad.
>>
> Why would the be bad?  I'm not sure how the cache works either, but
> assuming it loads X bytes at a time into the cache, if the constant data
> wasn't there wouldn't it just be loading garbage data instead?  It
> shouldn't matter if the constants are in the cache, since the GPU will
> never attempt to execute them.
>
For large kernels, if the code is spilling out of the instruction cache, 
it's the most significant source of performance problems. If jumping 
between real code and this data stuck at the end of the instructions 
could evict instructions, I think that would be bad. I don't know if 
this is a realistic concern though. I also don't think setting up 
special buffer arguments can be avoided forever since printf and some 
other features will probably need them


>>> +SDValue SITargetLowering::LowerGlobalAddress(AMDGPUMachineFunction *MFI,
>>> +                                             SDValue Op,
>>> +                                             SelectionDAG &DAG) const {
>>> +  GlobalAddressSDNode *GSD = cast<GlobalAddressSDNode>(Op);
>>> +
>>> +  if (GSD->getAddressSpace() != AMDGPUAS::CONSTANT_ADDRESS)
>>> +    return AMDGPUTargetLowering::LowerGlobalAddress(MFI, Op, DAG);
>>> +
>>> +  SDLoc DL(GSD);
>>> +  const GlobalValue *GV = GSD->getGlobal();
>>> +  MVT PtrVT = getPointerTy(GSD->getAddressSpace());
>>> +
>>> +  SDValue Ptr = DAG.getNode(AMDGPUISD::CONST_DATA_PTR, DL, PtrVT);
>>> +  SDValue GA = DAG.getTargetGlobalAddress(GV, DL, MVT::i32);
>>> +
>>> +  SDValue PtrLo = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, Ptr,
>>> +                              DAG.getConstant(0, MVT::i32));
>>> +  SDValue PtrHi = DAG.getNode(ISD::EXTRACT_ELEMENT, DL, MVT::i32, Ptr,
>>> +                              DAG.getConstant(1, MVT::i32));
>>> +
>>> +  SDValue Lo = DAG.getNode(ISD::ADDC, DL, DAG.getVTList(MVT::i32, MVT::Glue),
>>> +                           PtrLo, GA);
>>> +  SDValue Hi = DAG.getNode(ISD::ADDE, DL, DAG.getVTList(MVT::i32, MVT::Glue),
>>> +                           PtrHi, DAG.getConstant(0, MVT::i32),
>>> +                           SDValue(Lo.getNode(), 1));
>>> +  return DAG.getNode(ISD::BUILD_PAIR, DL, MVT::i64, Lo, Hi);
>> Why not just do the add of the pointer type and let the 64-bit add be
>> legalized normally? It would also be 1 less obstacle to adding support
>> for 32-bit pointers
>>
> I did it this way to save a zext operation and also save a register by
> using 0 as the hi bits of the offsets.  I can change this to do 64-bit
> add and then I'll add a target DAG combine if the generated code isn't
> great.
>
> -Tom
I'm not sure I see the problem. Isn't there some problem from using 
BUILD_PAIR? You made a few commits a while ago changing them to use 
bitcast + build_vector for some reason


>>> +}
>>> +
>>>    SDValue SITargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const {
>>>      SDLoc DL(Op);
>>>      LoadSDNode *Load = cast<LoadSDNode>(Op);
>>> diff --git a/lib/Target/R600/SIISelLowering.h b/lib/Target/R600/SIISelLowering.h
>>> index e25323a..4e1c949 100644
>>> --- a/lib/Target/R600/SIISelLowering.h
>>> +++ b/lib/Target/R600/SIISelLowering.h
>>> @@ -25,6 +25,8 @@ class SITargetLowering : public AMDGPUTargetLowering {
>>>                             SDValue Chain, unsigned Offset, bool Signed) const;
>>>      SDValue LowerSampleIntrinsic(unsigned Opcode, const SDValue &Op,
>>>                                   SelectionDAG &DAG) const;
>>> +  SDValue LowerGlobalAddress(AMDGPUMachineFunction *MFI, SDValue Op,
>>> +                             SelectionDAG &DAG) const override;
>>>      SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
>>>      SDValue LowerSELECT(SDValue Op, SelectionDAG &DAG) const;
>>>      SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
>>> diff --git a/lib/Target/R600/SIInstrInfo.cpp b/lib/Target/R600/SIInstrInfo.cpp
>>> index 455c890..15c9a5e 100644
>>> --- a/lib/Target/R600/SIInstrInfo.cpp
>>> +++ b/lib/Target/R600/SIInstrInfo.cpp
>>> @@ -361,6 +361,26 @@ bool SIInstrInfo::expandPostRAPseudo(MachineBasicBlock::iterator MI) const {
>>>        MI->eraseFromParent();
>>>        break;
>>>      }
>>> +  case AMDGPU::SI_CONSTDATA_PTR: {
>>> +    unsigned Reg = MI->getOperand(0).getReg();
>>> +    unsigned RegLo = RI.getSubReg(Reg, AMDGPU::sub0);
>>> +    unsigned RegHi = RI.getSubReg(Reg, AMDGPU::sub1);
>>> +
>>> +    BuildMI(MBB, MI, DL, get(AMDGPU::S_GETPC_B64), Reg);
>>> +
>>> +    // Add 32-bit offset from this instruction to the start of the constant data.
>>> +    BuildMI(MBB, MI, DL, get(AMDGPU::S_ADD_I32), RegLo)
>>> +            .addReg(RegLo)
>>> +            .addTargetIndex(AMDGPU::TI_CONSTDATA_START)
>>> +            .addReg(AMDGPU::SCC, RegState::Define | RegState::Implicit);
>>> +    BuildMI(MBB, MI, DL, get(AMDGPU::S_ADDC_U32), RegHi)
>>> +            .addReg(RegHi)
>>> +            .addImm(0)
>>> +            .addReg(AMDGPU::SCC, RegState::Define | RegState::Implicit)
>>> +            .addReg(AMDGPU::SCC, RegState::Implicit);
>>> +    MI->eraseFromParent();
>>> +    break;
>>> +  }
>>>      }
>>>      return true;
>>>    }
>>> diff --git a/lib/Target/R600/SIInstrInfo.td b/lib/Target/R600/SIInstrInfo.td
>>> index 7093db6..c1f22a5 100644
>>> --- a/lib/Target/R600/SIInstrInfo.td
>>> +++ b/lib/Target/R600/SIInstrInfo.td
>>> @@ -57,6 +57,10 @@ def SIsampleb : SDSample<"AMDGPUISD::SAMPLEB">;
>>>    def SIsampled : SDSample<"AMDGPUISD::SAMPLED">;
>>>    def SIsamplel : SDSample<"AMDGPUISD::SAMPLEL">;
>>>    
>>> +def SIconstdata_ptr : SDNode<
>>> +  "AMDGPUISD::CONST_DATA_PTR", SDTypeProfile <1, 0, [SDTCisVT<0, i64>]>
>>> +>;
>>> +
>>>    // Transformation function, extract the lower 32bit of a 64bit immediate
>>>    def LO32 : SDNodeXForm<imm, [{
>>>      return CurDAG->getTargetConstant(N->getZExtValue() & 0xffffffff, MVT::i32);
>>> diff --git a/lib/Target/R600/SIInstructions.td b/lib/Target/R600/SIInstructions.td
>>> index 4f480cd..81f8e15 100644
>>> --- a/lib/Target/R600/SIInstructions.td
>>> +++ b/lib/Target/R600/SIInstructions.td
>>> @@ -139,7 +139,11 @@ def S_SEXT_I32_I16 : SOP1_32 <0x0000001a, "S_SEXT_I32_I16",
>>>    ////def S_BITSET0_B64 : SOP1_BITSET0 <0x0000001c, "S_BITSET0_B64", []>;
>>>    ////def S_BITSET1_B32 : SOP1_BITSET1 <0x0000001d, "S_BITSET1_B32", []>;
>>>    ////def S_BITSET1_B64 : SOP1_BITSET1 <0x0000001e, "S_BITSET1_B64", []>;
>>> -def S_GETPC_B64 : SOP1_64 <0x0000001f, "S_GETPC_B64", []>;
>>> +def S_GETPC_B64 : SOP1 <
>>> +  0x0000001f, (outs SReg_64:$dst), (ins), "S_GETPC_B64 $dst", []
>>> +> {
>>> +  let SSRC0 = 0;
>>> +}
>>>    def S_SETPC_B64 : SOP1_64 <0x00000020, "S_SETPC_B64", []>;
>>>    def S_SWAPPC_B64 : SOP1_64 <0x00000021, "S_SWAPPC_B64", []>;
>>>    def S_RFE_B64 : SOP1_64 <0x00000022, "S_RFE_B64", []>;
>>> @@ -1681,6 +1685,16 @@ defm SI_SPILL_S128 : SI_SPILL_SGPR <SReg_128>;
>>>    defm SI_SPILL_S256 : SI_SPILL_SGPR <SReg_256>;
>>>    defm SI_SPILL_S512 : SI_SPILL_SGPR <SReg_512>;
>>>    
>>> +let Defs = [SCC] in {
>>> +
>>> +def SI_CONSTDATA_PTR : InstSI <
>>> +  (outs SReg_64:$dst),
>>> +  (ins),
>>> +  "", [(set SReg_64:$dst, (i64 SIconstdata_ptr))]
>>> +>;
>>> +
>>> +} // End Defs = [SCC]
>>> +
>>>    } // end IsCodeGenOnly, isPseudo
>>>    
>>>    } // end SubtargetPredicate = SI
>>> diff --git a/test/CodeGen/R600/gv-const-addrspace.ll b/test/CodeGen/R600/gv-const-addrspace.ll
>>> index db64a6f..074d908 100644
>>> --- a/test/CodeGen/R600/gv-const-addrspace.ll
>>> +++ b/test/CodeGen/R600/gv-const-addrspace.ll
>>> @@ -4,11 +4,11 @@
>>>    
>>>    @b = internal addrspace(2) constant [1 x i16] [ i16 7 ], align 2
>>>    
>>> -; XXX: Test on SI once 64-bit adds are supportes.
>>> -
>>>    @float_gv = internal unnamed_addr addrspace(2) constant [5 x float] [float 0.0, float 1.0, float 2.0, float 3.0, float 4.0], align 4
>>>    
>>>    ; FUNC-LABEL: @float
>>> +; FIXME: We should be using S_LOAD_DWORD here.
>>> +; SI: BUFFER_LOAD_DWORD
>>>    
>>>    ; EG-DAG: MOV {{\** *}}T2.X
>>>    ; EG-DAG: MOV {{\** *}}T3.X
>>> @@ -29,6 +29,9 @@ entry:
>>>    
>>>    ; FUNC-LABEL: @i32
>>>    
>>> +; FIXME: We should be using S_LOAD_DWORD here.
>>> +; SI: BUFFER_LOAD_DWORD
>>> +
>>>    ; EG-DAG: MOV {{\** *}}T2.X
>>>    ; EG-DAG: MOV {{\** *}}T3.X
>>>    ; EG-DAG: MOV {{\** *}}T4.X
>>> @@ -50,6 +53,7 @@ entry:
>>>    @struct_foo_gv = internal unnamed_addr addrspace(2) constant [1 x %struct.foo] [ %struct.foo { float 16.0, [5 x i32] [i32 0, i32 1, i32 2, i32 3, i32 4] } ]
>>>    
>>>    ; FUNC-LABEL: @struct_foo_gv_load
>>> +; SI: S_LOAD_DWORD
>>>    
>>>    define void @struct_foo_gv_load(i32 addrspace(1)* %out, i32 %index) {
>>>      %gep = getelementptr inbounds [1 x %struct.foo] addrspace(2)* @struct_foo_gv, i32 0, i32 0, i32 1, i32 %index
>>> @@ -64,6 +68,8 @@ define void @struct_foo_gv_load(i32 addrspace(1)* %out, i32 %index) {
>>>                                                                    <1 x i32> <i32 4> ]
>>>    
>>>    ; FUNC-LABEL: @array_v1_gv_load
>>> +; FIXME: We should be using S_LOAD_DWORD here.
>>> +; SI: BUFFER_LOAD_DWORD
>>>    define void @array_v1_gv_load(<1 x i32> addrspace(1)* %out, i32 %index) {
>>>      %gep = getelementptr inbounds [4 x <1 x i32>] addrspace(2)* @array_v1_gv, i32 0, i32 %index
>>>      %load = load <1 x i32> addrspace(2)* %gep, align 4
>>> diff --git a/test/CodeGen/R600/large-constant-initializer.ll b/test/CodeGen/R600/large-constant-initializer.ll
>>> index 552cd05..191b5c3 100644
>>> --- a/test/CodeGen/R600/large-constant-initializer.ll
>>> +++ b/test/CodeGen/R600/large-constant-initializer.ll
>>> @@ -1,6 +1,5 @@
>>> -; XFAIL: *
>>> -; REQUIRES: asserts
>>>    ; RUN: llc -march=r600 -mcpu=SI < %s
>>> +; CHECK: S_ENDPGM
>>>    
>>>    @gv = external unnamed_addr addrspace(2) constant [239 x i32], align 4
>>>    
>>> -- 1.8.1.5
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits