[llvm-dev] How to describe the RegisterInfo?

Wed Aug 24 07:13:10 PDT 2016

2016-08-24 11:55 GMT+08:00 Ruiling Song <ruiling.song83 at gmail.com>:

>
>
> 2016-08-24 2:34 GMT+08:00 Matthias Braun <mbraun at apple.com>:
>
>>
>> > On Aug 23, 2016, at 12:08 AM, Ruiling Song via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> >
>> > Yes, the arch is just as you said, something like AMD GPU, but Intel
>> GPU don't have separate register file for 'scalar/vector'.
>> > In fact my idea of defining the register tuples was borrowed from
>> SIRegisterInfo.td in AMD GPU.
>> > But seems that AMD GPU mainly support i32/i64 register type, while
>> Intel GPU also support byte/short register type.
>> > So I have to start defining the registers from 'byte' type, and then
>> build up other type registers through RegisterTuples.
>> > I thought RegisterTuple is kind of expressing register alias in
>> RegisterInfo.td file. I am not sure whether I understand it correctly. My
>> first trial was like below(to make things simple, I remove some WORD/QWORD
>> register class):
>> > let Namespace = "IntelGPU" in {
>> >
>> > foreach Index = 0-15 in {
>> >   def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
>> > }
>> > }
>> >
>> > class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
>> >   bits<2> HStride;
>> >   bits<1> regFile;
>> >
>> >   let Namespace = "IntelGPU";
>> >   let HWEncoding{12-0}  = regIdx;
>> >   let HWEncoding{15}    = regFile;
>> > }
>> > // here I define the whole 4096 byte registers
>> > foreach Index = 0-4095 in {
>> >   def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {
>> >     let regFile = 0;
>> >   }
>> > }
>> >
>> > // b-->byte w-->word d-->dword q-->qword
>> > // the set of uniform byte register
>> > def gpr_b : RegisterClass<"IntelGPU", [i8], 8,
>> >                           (sequence "Rb%u", 0, 4095)> {
>> >   let AllocationPriority = 1;
>> > }
>> >
>> > def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
>> >                               [(add (decimate gpr_b, 4)),
>> >                                (add (decimate (shl gpr_b, 1), 4)),
>> >                                (add (decimate (shl gpr_b, 2), 4)),
>> >                                (add (decimate (shl gpr_b, 3), 4))]>;
>> >
>> > // simd byte use stride 2 register as stride 1 does not support useful
>> ALU instruction
>> > def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
>> sub6, sub7],
>> >                                  [(add (decimate gpr_b, 16)),
>> >                                   (add (decimate (shl gpr_b, 2), 16)),
>> >                                   (add (decimate (shl gpr_b, 4), 16)),
>> >                                   (add (decimate (shl gpr_b, 6), 16)),
>> >                                   (add (decimate (shl gpr_b, 8), 16)),
>> >                                   (add (decimate (shl gpr_b, 10), 16)),
>> >                                   (add (decimate (shl gpr_b, 12), 16)),
>> >                                   (add (decimate (shl gpr_b, 14),
>> 16))]>;
>> >
>> > def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
>> sub6, sub7],
>> >                                 [(add (decimate gpr_d, 8)),
>> >                                  (add (decimate (shl gpr_d, 1), 8)),
>> >                                  (add (decimate (shl gpr_d, 2), 8)),
>> >                                  (add (decimate (shl gpr_d, 3), 8)),
>> >                                  (add (decimate (shl gpr_d, 4), 8)),
>> >                                  (add (decimate (shl gpr_d, 5), 8)),
>> >                                  (add (decimate (shl gpr_d, 6), 8)),
>> >                                  (add (decimate (shl gpr_d, 7), 8))]>;
>> > def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add
>> gpr_d)>;
>> > def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
>> gpr_d_simd8)> {
>> > }
>> > This is easy for me to define the register alias information. But it
>> won't works!
>> > the tablegen exit and tells me: "error:Ran out of lanemask bits to
>> represent subregister sub1_then_sub1"
>> > Anybody know what's wrong here?
>>
>> lanemasks are used at several places in the compiler to describe
>> live/dead subregisters parts. That is if you take your largest register
>> (may be a tuple) how many different subregisters you can reach by that. I
>> would expect that in your example you can from a gpr_d_simd8 you can reach
>> 8 gpr_d registers through sub0-sub7 and from each gpr_d you can reach 4
>> gpr_b registers through sub0-sub3. This should be fine with 32 bites/lanes.
>> I am not sure if that is the problem here but I think you should use
>> different subregisters indixes for the byte access (bsub0-bsub3) than you
>> used for the higher level tuples.
>>
>> You could also experiment with increasing the limit in Tablegen and
>> changing the LaneBitmask typedef, however this has possible implications on
>> memory use and performance of the register allocator so it would be good to
>> find a way to avoid that.
>>
>> - Matthias
>>
> Hi Matthias,
>
> Thanks for your explanation. It really helps me! I tried and make sure
> that 32bit lanemask works for gpr_d_simd8 to reach 8 gpr_d register through
> subd0-subd7 and then reach 4 gpr_b register through sub0-sub3.
> Based on this, the new RegisterInfo.td looks like below. As there is only
> 32 bit lanemask, I choose to define Rw# (register of word) instead of Rb#.
> I think with word register as a base, I can describe simd8 QWord register
> at least. But it does not works if I add in gpr_q_simd8 register.
> Follow your advice, w0-w3 is used as subregister index for the low-level
> to access word. and subd0-subd7 as the subregister index for the second
> level for dword.
>
> let Namespace = "IntelGPU" in {
>
> foreach Index = 0-3 in {
>   def w#Index : SubRegIndex<16, !shl(Index, 4)>;
> }
> foreach Index = 0-7 in {
> //  def subw#Index : SubRegIndex<16, !shl(Index, 4)>;
>   def subd#Index : SubRegIndex<32, !shl(Index, 5)>;
> //  def subq#Index : SubRegIndex<64, !shl(Index, 6)>;
> }
> }
>
> class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
>   bits<2> HStride;
>   bits<1> regFile;
>
>   let Namespace = "IntelGPU";
>   let HWEncoding{12-0}  = regIdx;
>   let HWEncoding{15}    = regFile;
> }
> foreach Index = 0-2047 in {
>   def Rw#Index : IntelGPUReg <"Rw"#Index, !shl(Index, 1)> {
>     let regFile = 0;
>   }
> }
>
> // b-->byte w-->word d-->dword q-->qword
>
> def gpr_w : RegisterClass<"IntelGPU", [i16], 16,
>                           (sequence "Rw%u", 0, 2047)> {
>   let AllocationPriority = 1;
> }
>
> def gpr_d : RegisterTuples<[w0, w1],
>                            [(add (decimate gpr_w, 2)),
>                             (add (decimate (shl gpr_w, 1), 2))]>;
>
> def gpr_q : RegisterTuples<[w0, w1, w2, w3],
>                            [(add (decimate gpr_w, 4)),
>                             (add (decimate (shl gpr_w, 1), 4)),
>                             (add (decimate (shl gpr_w, 2), 4)),
>                             (add (decimate (shl gpr_w, 3), 4))]>;
>
> //def gpr_w_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4,
> subw5, subw6, subw7],
> //                                [(add (decimate gpr_w, 8)),
> //                                 (add (decimate (shl gpr_w, 1), 8)),
> //                                 (add (decimate (shl gpr_w, 2), 8)),
> //                                 (add (decimate (shl gpr_w, 3), 8)),
> //                                 (add (decimate (shl gpr_w, 4), 8)),
> //                                 (add (decimate (shl gpr_w, 5), 8)),
> //                                 (add (decimate (shl gpr_w, 6), 8)),
> //                                 (add (decimate (shl gpr_w, 7), 8))]>;
>
> def gpr_d_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4,
> subd5, subd6, subd7],
>                                 [(add (decimate gpr_d, 8)),
>                                  (add (decimate (shl gpr_d, 1), 8)),
>                                  (add (decimate (shl gpr_d, 2), 8)),
>                                  (add (decimate (shl gpr_d, 3), 8)),
>                                  (add (decimate (shl gpr_d, 4), 8)),
>                                  (add (decimate (shl gpr_d, 5), 8)),
>                                  (add (decimate (shl gpr_d, 6), 8)),
>                                  (add (decimate (shl gpr_d, 7), 8))]>;
>
> The issue comes out in the below line, using subd0-subd7 will cause
> "llvm/utils/TableGen/CodeGenRegisters.cpp:1146: void
> llvm::CodeGenRegBank::computeComposites(): Assertion `Idx3 &&
> "Sub-register doesn't have an index"' failed"
> if changed to subq0-subq7, it will report "error:Ran out of lanemask bits
> to represent subregister subq4_then_w3"
> Am I wrong in defining the SubRegIndex ?? Or something I understand wrong?
> Basically I should use different SubRegIndex when declaring
> gpr_w_simd8/gpr_d_simd8/gpr_q_simd8 as the subregs are of different size,
> right?
>
>

I did some simple debugging for using subq0~subq7 through adding some log
in CodeGenRegBank::computeSubRegLaneMasks(),
1172   for (auto &Idx : SubRegIndices) {
1173     if (Idx.getComposites().empty()) {
1174       std::cout << std::string("SubRegIndex ") << Idx.getName()  << "
"<< Bit << std::endl;

it looks like below subreg lane masks was generated:
SubRegIndex w0 0
SubRegIndex w1 1
SubRegIndex w2 2
SubRegIndex w3 3
SubRegIndex subd7_then_w0 4
SubRegIndex subd7_then_w1 5
SubRegIndex subd6_then_w0 6
SubRegIndex subd6_then_w1 7
SubRegIndex subd5_then_w0 8
SubRegIndex subd5_then_w1 9
SubRegIndex subd4_then_w0 10
SubRegIndex subd4_then_w1 11
SubRegIndex subd3_then_w0 12
SubRegIndex subd3_then_w1 13
SubRegIndex subd2_then_w0 14
SubRegIndex subd2_then_w1 15
SubRegIndex subd1_then_w0 16
SubRegIndex subd1_then_w1 17
My question was can subd1_then_w0 share same lane mask as w2? and the same
question for subd1_then_w1 and w3.

- Ruiling

> def gpr_q_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4,
> subd5, subd6, subd7],
>                                 [(add (decimate gpr_q, 8)),
>                                  (add (decimate (shl gpr_q, 1), 8)),
>                                  (add (decimate (shl gpr_q, 2), 8)),
>                                  (add (decimate (shl gpr_q, 3), 8)),
>                                  (add (decimate (shl gpr_q, 4), 8)),
>                                  (add (decimate (shl gpr_q, 5), 8)),
>                                  (add (decimate (shl gpr_q, 6), 8)),
>                                  (add (decimate (shl gpr_q, 7), 8))]>;
>
> def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;
> def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
> gpr_d_simd8)>;
> def RegQ_Uniform : RegisterClass<"IntelGPU", [i64, f64], 64, (add gpr_q)>;
> def RegQ_SIMD8 : RegisterClass<"IntelGPU", [i64, f64], 64, (add
> gpr_q_simd8)>;
>
> - Ruiling
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160824/a71ef7f9/attachment-0001.html>