<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2016-08-24 11:55 GMT+08:00 Ruiling Song <span dir="ltr"><<a href="mailto:ruiling.song83@gmail.com" target="_blank">ruiling.song83@gmail.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><div><div class="h5"><br><div class="gmail_quote">2016-08-24 2:34 GMT+08:00 Matthias Braun <span dir="ltr"><<a href="mailto:mbraun@apple.com" target="_blank">mbraun@apple.com</a>></span>:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><br>


> On Aug 23, 2016, at 12:08 AM, Ruiling Song via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>


><br>


> Yes, the arch is just as you said, something like AMD GPU, but Intel GPU don't have separate register file for 'scalar/vector'.<br>


> In fact my idea of defining the register tuples was borrowed from SIRegisterInfo.td in AMD GPU.<br>


> But seems that AMD GPU mainly support i32/i64 register type, while Intel GPU also support byte/short register type.<br>


> So I have to start defining the registers from 'byte' type, and then build up other type registers through RegisterTuples.<br>


> I thought RegisterTuple is kind of expressing register alias in RegisterInfo.td file. I am not sure whether I understand it correctly. My first trial was like below(to make things simple, I remove some WORD/QWORD register class):<br>


> let Namespace = "IntelGPU" in {<br>


><br>


> foreach Index = 0-15 in {<br>


>   def sub#Index : SubRegIndex<32, !shl(Index, 5)>;<br>


> }<br>


> }<br>


><br>


> class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {<br>


>   bits<2> HStride;<br>


>   bits<1> regFile;<br>


><br>


>   let Namespace = "IntelGPU";<br>


>   let HWEncoding{12-0}  = regIdx;<br>


>   let HWEncoding{15}    = regFile;<br>


> }<br>


> // here I define the whole 4096 byte registers<br>


> foreach Index = 0-4095 in {<br>


>   def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {<br>


>     let regFile = 0;<br>


>   }<br>


> }<br>


><br>


> // b-->byte w-->word d-->dword q-->qword<br>


> // the set of uniform byte register<br>


> def gpr_b : RegisterClass<"IntelGPU", [i8], 8,<br>


>                           (sequence "Rb%u", 0, 4095)> {<br>


>   let AllocationPriority = 1;<br>


> }<br>


><br>


> def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],<br>


>                               [(add (decimate gpr_b, 4)),<br>


>                                (add (decimate (shl gpr_b, 1), 4)),<br>


>                                (add (decimate (shl gpr_b, 2), 4)),<br>


>                                (add (decimate (shl gpr_b, 3), 4))]>;<br>


><br>


> // simd byte use stride 2 register as stride 1 does not support useful ALU instruction<br>


> def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6, sub7],<br>


>                                  [(add (decimate gpr_b, 16)),<br>


>                                   (add (decimate (shl gpr_b, 2), 16)),<br>


>                                   (add (decimate (shl gpr_b, 4), 16)),<br>


>                                   (add (decimate (shl gpr_b, 6), 16)),<br>


>                                   (add (decimate (shl gpr_b, 8), 16)),<br>


>                                   (add (decimate (shl gpr_b, 10), 16)),<br>


>                                   (add (decimate (shl gpr_b, 12), 16)),<br>


>                                   (add (decimate (shl gpr_b, 14), 16))]>;<br>


><br>


> def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5, sub6, sub7],<br>


>                                 [(add (decimate gpr_d, 8)),<br>


>                                  (add (decimate (shl gpr_d, 1), 8)),<br>


>                                  (add (decimate (shl gpr_d, 2), 8)),<br>


>                                  (add (decimate (shl gpr_d, 3), 8)),<br>


>                                  (add (decimate (shl gpr_d, 4), 8)),<br>


>                                  (add (decimate (shl gpr_d, 5), 8)),<br>


>                                  (add (decimate (shl gpr_d, 6), 8)),<br>


>                                  (add (decimate (shl gpr_d, 7), 8))]>;<br>


> def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;<br>


> def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d_simd8)> {<br>


> }<br>


> This is easy for me to define the register alias information. But it won't works!<br>


> the tablegen exit and tells me: "error:Ran out of lanemask bits to represent subregister sub1_then_sub1"<br>


> Anybody know what's wrong here?<br>


<br>


</div></div>lanemasks are used at several places in the compiler to describe live/dead subregisters parts. That is if you take your largest register (may be a tuple) how many different subregisters you can reach by that. I would expect that in your example you can from a gpr_d_simd8 you can reach 8 gpr_d registers through sub0-sub7 and from each gpr_d you can reach 4 gpr_b registers through sub0-sub3. This should be fine with 32 bites/lanes. I am not sure if that is the problem here but I think you should use different subregisters indixes for the byte access (bsub0-bsub3) than you used for the higher level tuples.<br>


<br>


You could also experiment with increasing the limit in Tablegen and changing the LaneBitmask typedef, however this has possible implications on memory use and performance of the register allocator so it would be good to find a way to avoid that.<br>


<span><font color="#888888"><br>


- Matthias<br>


</font></span></blockquote></div></div></div>Hi Matthias,<br><br></div><div class="gmail_extra">Thanks for your explanation. It really helps me! I tried and make sure that 32bit lanemask works for gpr_d_simd8 to reach 8 gpr_d register through subd0-subd7 and then reach 4 gpr_b register through sub0-sub3.<br>Based on this, the new RegisterInfo.td looks like below. As there is only 32 bit lanemask, I choose to define Rw# (register of word) instead of Rb#. I think with word register as a base, I can describe simd8 QWord register at least. But it does not works if I add in gpr_q_simd8 register.<br>Follow your advice, w0-w3 is used as subregister index for the low-level to access word. and subd0-subd7 as the subregister index for the second level for dword.<span class=""><br><br>let Namespace = "IntelGPU" in {<br><br></span>foreach Index = 0-3 in {<br>  def w#Index : SubRegIndex<16, !shl(Index, 4)>;<br>}<br>foreach Index = 0-7 in {<br>//  def subw#Index : SubRegIndex<16, !shl(Index, 4)>;<span class=""><br>  def subd#Index : SubRegIndex<32, !shl(Index, 5)>;<br></span>//  def subq#Index : SubRegIndex<64, !shl(Index, 6)>;<span class=""><br>}<br>}<br><br>class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {<br>  bits<2> HStride;<br>  bits<1> regFile;<br><br>  let Namespace = "IntelGPU";<br>  let HWEncoding{12-0}  = regIdx;<br>  let HWEncoding{15}    = regFile;<br>}<br></span>foreach Index = 0-2047 in {<br>  def Rw#Index : IntelGPUReg <"Rw"#Index, !shl(Index, 1)> {<span class=""><br>    let regFile = 0;<br>  }<br>}<br><br>// b-->byte w-->word d-->dword q-->qword<br><br></span>def gpr_w : RegisterClass<"IntelGPU", [i16], 16,<br>                          (sequence "Rw%u", 0, 2047)> {<br>  let AllocationPriority = 1;<br>}<br><br>def gpr_d : RegisterTuples<[w0, w1],<br>                           [(add (decimate gpr_w, 2)),<br>                            (add (decimate (shl gpr_w, 1), 2))]>;<br><br>def gpr_q : RegisterTuples<[w0, w1, w2, w3],<br>                           [(add (decimate gpr_w, 4)),<br>                            (add (decimate (shl gpr_w, 1), 4)),<br>                            (add (decimate (shl gpr_w, 2), 4)),<br>                            (add (decimate (shl gpr_w, 3), 4))]>;<br><br>//def gpr_w_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4, subw5, subw6, subw7],<br>//                            <wbr>    [(add (decimate gpr_w, 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 1), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 2), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 3), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 4), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 5), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 6), 8)),<br>//                            <wbr>     (add (decimate (shl gpr_w, 7), 8))]>;<br><br>def gpr_d_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5, subd6, subd7],<span class=""><br>                              <wbr>  [(add (decimate gpr_d, 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 1), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 2), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 3), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 4), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 5), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 6), 8)),<br>                              <wbr>   (add (decimate (shl gpr_d, 7), 8))]>;<br><br></span></div><div class="gmail_extra">The issue comes out in the below line, using subd0-subd7 will cause "llvm/utils/TableGen/<wbr>CodeGenRegisters.cpp:1146: void llvm::CodeGenRegBank::<wbr>computeComposites(): Assertion `Idx3 && "Sub-register doesn't have an index"' failed"<br></div><div class="gmail_extra">if changed to subq0-subq7, it will report "error:Ran out of lanemask bits to represent subregister subq4_then_w3"<br></div><div class="gmail_extra">Am I wrong in defining the SubRegIndex ?? Or something I understand wrong?<br></div><div class="gmail_extra">Basically I should use different SubRegIndex when declaring gpr_w_simd8/gpr_d_simd8/gpr_q_<wbr>simd8 as the subregs are of different size, right?<br> </div></div></blockquote><div> </div><div>I did some simple debugging for using subq0~subq7 through adding some log in CodeGenRegBank::computeSubRegLaneMasks(), </div><div><div>1172   for (auto &Idx : SubRegIndices) {</div><div>1173     if (Idx.getComposites().empty()) {</div><div>1174       std::cout << std::string("SubRegIndex ") << Idx.getName()  << " "<< Bit << std::endl;</div></div><div><br></div><div>it looks like below subreg lane masks was generated:</div><div><div>SubRegIndex w0 0</div><div>SubRegIndex w1 1</div><div>SubRegIndex w2 2</div><div>SubRegIndex w3 3</div><div>SubRegIndex subd7_then_w0 4</div><div>SubRegIndex subd7_then_w1 5</div><div>SubRegIndex subd6_then_w0 6</div><div>SubRegIndex subd6_then_w1 7</div><div>SubRegIndex subd5_then_w0 8</div><div>SubRegIndex subd5_then_w1 9</div><div>SubRegIndex subd4_then_w0 10</div><div>SubRegIndex subd4_then_w1 11</div><div>SubRegIndex subd3_then_w0 12</div><div>SubRegIndex subd3_then_w1 13</div><div>SubRegIndex subd2_then_w0 14</div><div>SubRegIndex subd2_then_w1 15</div><div>SubRegIndex subd1_then_w0 16</div><div>SubRegIndex subd1_then_w1 17</div></div><div>My question was can subd1_then_w0 share same lane mask as w2? and the same question for subd1_then_w1 and w3.</div><div><br></div><div>- Ruiling</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"></div><div class="gmail_extra">def gpr_q_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5, subd6, subd7],<br>                              <wbr>  [(add (decimate gpr_q, 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 1), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 2), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 3), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 4), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 5), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 6), 8)),<br>                              <wbr>   (add (decimate (shl gpr_q, 7), 8))]>;<span class=""><br><br>def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;<br></span>def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d_simd8)>;<br>def RegQ_Uniform : RegisterClass<"IntelGPU", [i64, f64], 64, (add gpr_q)>;<br>def RegQ_SIMD8 : RegisterClass<"IntelGPU", [i64, f64], 64, (add gpr_q_simd8)>;<br><br></div><div class="gmail_extra">- Ruiling<br></div></div>


</blockquote></div><br></div></div>