[llvm-dev] How to describe the RegisterInfo?

Tue Aug 23 00:08:29 PDT 2016

Yes, the arch is just as you said, something like AMD GPU, but Intel GPU
don't have separate register file for 'scalar/vector'.
In fact my idea of defining the register tuples was borrowed from
SIRegisterInfo.td in AMD GPU.
But seems that AMD GPU mainly support i32/i64 register type, while Intel
GPU also support byte/short register type.
So I have to start defining the registers from 'byte' type, and then build
up other type registers through RegisterTuples.
I thought RegisterTuple is kind of expressing register alias in
RegisterInfo.td file. I am not sure whether I understand it correctly. My
first trial was like below(to make things simple, I remove some WORD/QWORD
register class):
let Namespace = "IntelGPU" in {

foreach Index = 0-15 in {
  def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
}
}

class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
  bits<2> HStride;
  bits<1> regFile;

  let Namespace = "IntelGPU";
  let HWEncoding{12-0}  = regIdx;
  let HWEncoding{15}    = regFile;
}
// here I define the whole 4096 byte registers
foreach Index = 0-4095 in {
  def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {
    let regFile = 0;
  }
}

// b-->byte w-->word d-->dword q-->qword
// the set of uniform byte register
def gpr_b : RegisterClass<"IntelGPU", [i8], 8,
                          (sequence "Rb%u", 0, 4095)> {
  let AllocationPriority = 1;
}

def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
                              [(add (decimate gpr_b, 4)),
                               (add (decimate (shl gpr_b, 1), 4)),
                               (add (decimate (shl gpr_b, 2), 4)),
                               (add (decimate (shl gpr_b, 3), 4))]>;

// simd byte use stride 2 register as stride 1 does not support useful ALU
instruction
def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6,
sub7],
                                 [(add (decimate gpr_b, 16)),
                                  (add (decimate (shl gpr_b, 2), 16)),
                                  (add (decimate (shl gpr_b, 4), 16)),
                                  (add (decimate (shl gpr_b, 6), 16)),
                                  (add (decimate (shl gpr_b, 8), 16)),
                                  (add (decimate (shl gpr_b, 10), 16)),
                                  (add (decimate (shl gpr_b, 12), 16)),
                                  (add (decimate (shl gpr_b, 14), 16))]>;

def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6,
sub7],
                                [(add (decimate gpr_d, 8)),
                                 (add (decimate (shl gpr_d, 1), 8)),
                                 (add (decimate (shl gpr_d, 2), 8)),
                                 (add (decimate (shl gpr_d, 3), 8)),
                                 (add (decimate (shl gpr_d, 4), 8)),
                                 (add (decimate (shl gpr_d, 5), 8)),
                                 (add (decimate (shl gpr_d, 6), 8)),
                                 (add (decimate (shl gpr_d, 7), 8))]>;
def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;
def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d_simd8)> {
}
This is easy for me to define the register alias information. But it won't
works!
the tablegen exit and tells me: "error:Ran out of lanemask bits to
represent subregister sub1_then_sub1"
Anybody know what's wrong here?

- Ruiling

2016-08-23 11:45 GMT+08:00 <escha at apple.com>:

> If I understand right, on this arch, ‘uniform’ refers to values that only
> take one lane of register file instead of SIMD-width lanes, and they
> *share* the same region of the register file as non-uniform values. This is
> in contrast to e.g. AMDGPU where SGPRs (scalar GPRs) and VGPRs are separate
> register files.
>
> If this understanding is correct, you may be able to define uniform and
> non-uniform registers separately, but make sure that one aliases the other,
> e.g. so that (if your SIMD width is 16) VGPR 20 overlaps SGPR 320,
> 321….335. So you can have 128 vector registers, 16*128 uniforms, or a mix
> of the two.
>
> (Maybe some of the AMDGPU maintainers have thoughts?)
>
> —escha
>
>
> On Aug 22, 2016, at 8:07 PM, Ruiling Song <ruiling.song83 at gmail.com>
> wrote:
>
> Hi Escha,
>
> Great to have your comment! Do you have any specific reason for not doing
> like this?
> I am not sure whether I understand your point correctly. For "just model
> one thread",
> do you mean "only considering ONE of the 8/16 working lanes that running
> in lock-step way"??
>
> For my case, may be something like I only need to define r0~r127 as
> register for i32 register (each r# is just enough for simd8 i32).
> Then the register allocator never need to go to allocate the
> sub-registers, just operate them as a whole. right?
>
> Yes, it looks really easy for divergent registers. But I think then I
> would lose the ability
> to allocate uniform register. Am I right? Is there any way to allocate
> uniform register
> as well as allocate divergent register?
>
> Thanks!
> Ruiling
>
> 2016-08-23 0:32 GMT+08:00 <escha at apple.com>:
>
>>
>> On Aug 22, 2016, at 6:46 AM, Ruiling Song via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Hello Everyone,
>>
>> I am trying to make a new LLVM backend target for Intel GPU.
>> I would start from targeting OpenCL language first.
>> But I am not quite familiar with LLVM backend infrastructure.
>> I have some problem on describing the RegisterInfo.
>>
>> Intel GPU launches lots of hardware threads to do GPGPU workload.
>> Each hardware thread has 128 registers(r0-r127), with each one of size 32
>> byte.
>> Each hardware thread may run in SIMD 8/16/32 way, which maps to
>> 8/16/32 OpenCL working items. And the SIMD width is chosen at
>> compile time (normally chosen according to register pressure, bigger simd
>> width means bigger register pressure).
>> Note each instruction has each own exec-width, which may not be equal to
>> program SIMD width.
>> Normally we would allocate contiguous registers for divergent value.
>> For example, we have a program compiled as SIMD 8, we need to allocate 4
>> byte*8=32 byte
>> value for a divergent float/i32 value. But if there is a 'short type'
>> value,
>> it only needs 2 byte*8=16 byte, that is half of a 32-byte-register.
>> we may also allocate for 'uniform' value, a uniform value only needs
>> type-sized register,
>> without multiply 'simd-width'. A uniform float/i32 value only needs 4
>> byte physical register.
>> Thus a 32-byte-register can hold up to 8 different uniform float/i32
>> values.
>>
>>
>> As a GPU backend maintainer, I strongly discourage trying to model the
>> total register bank of the GPU in LLVM. Just model one thread. This will
>> make things much, much easier.
>>
>>
>> —escha
>>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160823/a3fe717a/attachment.html>