[llvm-dev] How to describe the RegisterInfo?
Ruiling Song via llvm-dev
llvm-dev at lists.llvm.org
Tue Aug 23 20:55:18 PDT 2016
2016-08-24 2:34 GMT+08:00 Matthias Braun <mbraun at apple.com>:
>
> > On Aug 23, 2016, at 12:08 AM, Ruiling Song via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> >
> > Yes, the arch is just as you said, something like AMD GPU, but Intel GPU
> don't have separate register file for 'scalar/vector'.
> > In fact my idea of defining the register tuples was borrowed from
> SIRegisterInfo.td in AMD GPU.
> > But seems that AMD GPU mainly support i32/i64 register type, while Intel
> GPU also support byte/short register type.
> > So I have to start defining the registers from 'byte' type, and then
> build up other type registers through RegisterTuples.
> > I thought RegisterTuple is kind of expressing register alias in
> RegisterInfo.td file. I am not sure whether I understand it correctly. My
> first trial was like below(to make things simple, I remove some WORD/QWORD
> register class):
> > let Namespace = "IntelGPU" in {
> >
> > foreach Index = 0-15 in {
> > def sub#Index : SubRegIndex<32, !shl(Index, 5)>;
> > }
> > }
> >
> > class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
> > bits<2> HStride;
> > bits<1> regFile;
> >
> > let Namespace = "IntelGPU";
> > let HWEncoding{12-0} = regIdx;
> > let HWEncoding{15} = regFile;
> > }
> > // here I define the whole 4096 byte registers
> > foreach Index = 0-4095 in {
> > def Rb#Index : IntelGPUReg <"Rb"#Index, Index> {
> > let regFile = 0;
> > }
> > }
> >
> > // b-->byte w-->word d-->dword q-->qword
> > // the set of uniform byte register
> > def gpr_b : RegisterClass<"IntelGPU", [i8], 8,
> > (sequence "Rb%u", 0, 4095)> {
> > let AllocationPriority = 1;
> > }
> >
> > def gpr_d : RegisterTuples<[sub0, sub1, sub2, sub3],
> > [(add (decimate gpr_b, 4)),
> > (add (decimate (shl gpr_b, 1), 4)),
> > (add (decimate (shl gpr_b, 2), 4)),
> > (add (decimate (shl gpr_b, 3), 4))]>;
> >
> > // simd byte use stride 2 register as stride 1 does not support useful
> ALU instruction
> > def gpr_b_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
> sub6, sub7],
> > [(add (decimate gpr_b, 16)),
> > (add (decimate (shl gpr_b, 2), 16)),
> > (add (decimate (shl gpr_b, 4), 16)),
> > (add (decimate (shl gpr_b, 6), 16)),
> > (add (decimate (shl gpr_b, 8), 16)),
> > (add (decimate (shl gpr_b, 10), 16)),
> > (add (decimate (shl gpr_b, 12), 16)),
> > (add (decimate (shl gpr_b, 14), 16))]>;
> >
> > def gpr_d_simd8 : RegisterTuples<[sub0, sub1, sub2, sub3, sub4, sub5,
> sub6, sub7],
> > [(add (decimate gpr_d, 8)),
> > (add (decimate (shl gpr_d, 1), 8)),
> > (add (decimate (shl gpr_d, 2), 8)),
> > (add (decimate (shl gpr_d, 3), 8)),
> > (add (decimate (shl gpr_d, 4), 8)),
> > (add (decimate (shl gpr_d, 5), 8)),
> > (add (decimate (shl gpr_d, 6), 8)),
> > (add (decimate (shl gpr_d, 7), 8))]>;
> > def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add
> gpr_d)>;
> > def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
> gpr_d_simd8)> {
> > }
> > This is easy for me to define the register alias information. But it
> won't works!
> > the tablegen exit and tells me: "error:Ran out of lanemask bits to
> represent subregister sub1_then_sub1"
> > Anybody know what's wrong here?
>
> lanemasks are used at several places in the compiler to describe live/dead
> subregisters parts. That is if you take your largest register (may be a
> tuple) how many different subregisters you can reach by that. I would
> expect that in your example you can from a gpr_d_simd8 you can reach 8
> gpr_d registers through sub0-sub7 and from each gpr_d you can reach 4 gpr_b
> registers through sub0-sub3. This should be fine with 32 bites/lanes. I am
> not sure if that is the problem here but I think you should use different
> subregisters indixes for the byte access (bsub0-bsub3) than you used for
> the higher level tuples.
>
> You could also experiment with increasing the limit in Tablegen and
> changing the LaneBitmask typedef, however this has possible implications on
> memory use and performance of the register allocator so it would be good to
> find a way to avoid that.
>
> - Matthias
>
Hi Matthias,
Thanks for your explanation. It really helps me! I tried and make sure that
32bit lanemask works for gpr_d_simd8 to reach 8 gpr_d register through
subd0-subd7 and then reach 4 gpr_b register through sub0-sub3.
Based on this, the new RegisterInfo.td looks like below. As there is only
32 bit lanemask, I choose to define Rw# (register of word) instead of Rb#.
I think with word register as a base, I can describe simd8 QWord register
at least. But it does not works if I add in gpr_q_simd8 register.
Follow your advice, w0-w3 is used as subregister index for the low-level to
access word. and subd0-subd7 as the subregister index for the second level
for dword.
let Namespace = "IntelGPU" in {
foreach Index = 0-3 in {
def w#Index : SubRegIndex<16, !shl(Index, 4)>;
}
foreach Index = 0-7 in {
// def subw#Index : SubRegIndex<16, !shl(Index, 4)>;
def subd#Index : SubRegIndex<32, !shl(Index, 5)>;
// def subq#Index : SubRegIndex<64, !shl(Index, 6)>;
}
}
class IntelGPUReg<string n, bits<13> regIdx> : Register<n> {
bits<2> HStride;
bits<1> regFile;
let Namespace = "IntelGPU";
let HWEncoding{12-0} = regIdx;
let HWEncoding{15} = regFile;
}
foreach Index = 0-2047 in {
def Rw#Index : IntelGPUReg <"Rw"#Index, !shl(Index, 1)> {
let regFile = 0;
}
}
// b-->byte w-->word d-->dword q-->qword
def gpr_w : RegisterClass<"IntelGPU", [i16], 16,
(sequence "Rw%u", 0, 2047)> {
let AllocationPriority = 1;
}
def gpr_d : RegisterTuples<[w0, w1],
[(add (decimate gpr_w, 2)),
(add (decimate (shl gpr_w, 1), 2))]>;
def gpr_q : RegisterTuples<[w0, w1, w2, w3],
[(add (decimate gpr_w, 4)),
(add (decimate (shl gpr_w, 1), 4)),
(add (decimate (shl gpr_w, 2), 4)),
(add (decimate (shl gpr_w, 3), 4))]>;
//def gpr_w_simd8 : RegisterTuples<[subw0, subw1, subw2, subw3, subw4,
subw5, subw6, subw7],
// [(add (decimate gpr_w, 8)),
// (add (decimate (shl gpr_w, 1), 8)),
// (add (decimate (shl gpr_w, 2), 8)),
// (add (decimate (shl gpr_w, 3), 8)),
// (add (decimate (shl gpr_w, 4), 8)),
// (add (decimate (shl gpr_w, 5), 8)),
// (add (decimate (shl gpr_w, 6), 8)),
// (add (decimate (shl gpr_w, 7), 8))]>;
def gpr_d_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5,
subd6, subd7],
[(add (decimate gpr_d, 8)),
(add (decimate (shl gpr_d, 1), 8)),
(add (decimate (shl gpr_d, 2), 8)),
(add (decimate (shl gpr_d, 3), 8)),
(add (decimate (shl gpr_d, 4), 8)),
(add (decimate (shl gpr_d, 5), 8)),
(add (decimate (shl gpr_d, 6), 8)),
(add (decimate (shl gpr_d, 7), 8))]>;
The issue comes out in the below line, using subd0-subd7 will cause
"llvm/utils/TableGen/CodeGenRegisters.cpp:1146: void
llvm::CodeGenRegBank::computeComposites(): Assertion `Idx3 && "Sub-register
doesn't have an index"' failed"
if changed to subq0-subq7, it will report "error:Ran out of lanemask bits
to represent subregister subq4_then_w3"
Am I wrong in defining the SubRegIndex ?? Or something I understand wrong?
Basically I should use different SubRegIndex when declaring
gpr_w_simd8/gpr_d_simd8/gpr_q_simd8 as the subregs are of different size,
right?
def gpr_q_simd8 : RegisterTuples<[subd0, subd1, subd2, subd3, subd4, subd5,
subd6, subd7],
[(add (decimate gpr_q, 8)),
(add (decimate (shl gpr_q, 1), 8)),
(add (decimate (shl gpr_q, 2), 8)),
(add (decimate (shl gpr_q, 3), 8)),
(add (decimate (shl gpr_q, 4), 8)),
(add (decimate (shl gpr_q, 5), 8)),
(add (decimate (shl gpr_q, 6), 8)),
(add (decimate (shl gpr_q, 7), 8))]>;
def RegD_Uniform : RegisterClass<"IntelGPU", [i32, f32], 32, (add gpr_d)>;
def RegD_SIMD8 : RegisterClass<"IntelGPU", [i32, f32], 32, (add
gpr_d_simd8)>;
def RegQ_Uniform : RegisterClass<"IntelGPU", [i64, f64], 64, (add gpr_q)>;
def RegQ_SIMD8 : RegisterClass<"IntelGPU", [i64, f64], 64, (add
gpr_q_simd8)>;
- Ruiling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160824/f2d0e8bf/attachment.html>
More information about the llvm-dev
mailing list