[llvm-dev] [RFC] Array Register Files

Sun Oct 7 12:56:19 PDT 2018

[reordering a bit]

On Sun, Oct 7, 2018 at 6:45 PM Jacob Lifshay <programmerjake at gmail.com> wrote:

> This sounds like it would also be useful for allocating the consecutive
> registers needed to implement vectors in SimpleV, a parallelism
> extension to RISC-V.

 yes it does.  comments inline, below

> On Sun, Oct 7, 2018, 10:39 Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote:

>> The motivation is that the existing infrastructure really isn't a good
>> fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector
>> registers.

 256 vector registers is the number that RISC-V RVV future extensions
plan to have.  i think the RVV working group would really like to know
of any plans in this area.

>> 1. The order of register matters: if we use only a subset of registers,
>> and that subset is on the low end of the range, we can run more work in
>> parallel on the hardware. (This is modeled with regalloc priorities today.)
>>
>> 2. We can indirectly index into both register files. If a function has
>> an alloca'd array of 17 values, we may want to lower that as 17
>> consecutive registers and access the array using indirect access
>> instructions.

 SV (the custom extension jacob kindly mentioned) also has register
redirection.  it's slightly different in that it's completely
transparent: all *SCALAR* operations, if a register is marked in the
redirection table, will (a) be redirected through a key-value table
from 5 bits (the standard RISC-V register file size) to 6 bits and (b)
checked to see if it's a vector.

>> 3. Even this aside, the number of register classes we'd really need is
>> quite staggering. We have machine instructions taking operands with
>> anywhere from 1 to 12 consecutive registers.

 is that length part of the instruction or is it set up by way of an
indirect status register of some kind?

>> What I'd like to have
>> ---------------------
>> I'd like to introduce the notion of array register files. Physical
>> registers in an ARF would be described by
>>
>> - the register array ID
>> - the starting index of the "register"
>> - the ending index of the "register" (inclusive)

 let me think that through from the SV extension perspective.  with
SV, i am following the example of RVV.  there is one "global" status
register (example assembly code can be seen here
https://www.sigarch.org/simd-instructions-considered-harmful/)   i did
give serious consideration to having a per-register length... am still
mulling that one over as it takes up a lot of extra CSR space.

 so if such a notion existed, then the SV-LLVM back-end would need to
look at the table, go "hmm, this ARF entry requires a length of [e.g.]
6, the current global length is 4, let's push that on the stack and
change the global".  which would be perfectly manageable.

 so yes, as long as those entries in the table cover contiguous
registers, _great_.

 also, one thing: do you envisage the ARF table covering *overlapping*
registers?  the reason i ask is because SV doesn't care if they are
(i.e. it's the compiler's problem to sort out).

 i.e.:
* ARF1 = { id: 0, start=r5, end=r9}
* ARF2 = { id: 1, start=r6, end=r10}

the reason i ask is because on the isa-dev list jacob proposed an
algorithm that could actually be implemented by issuing a parallel-add
involving two (equal length) overlapping ARF entries: add ARF1, ARF2.

>> This can be conveniently encoded in the 31 bits we effectively have
>> available for physical registers (e.g. 13 bits for start / end, 5 bits
>> for register array ID).

 cool.  SV is 6 bits for start / end, 6 bits for length.

>> Register array ID 0 would be reserved for the familiar, TableGen-defined
>> registers (we would still have some of those in AMDGPU, even with this
>> change).

 i'm not familar with what tablegen is, so can't comment.

>> It would necessarily have to be possible to generate register classes
>> for ARFs both in TableGen and on-the-fly while compiling. Base register
>> classes would be defined by:
>>
>> - the register array ID
>> - the minimum start / maximum end index of registers in the class
>> - the size of registers in the class
>> - the alignment of registers in the class (i.e., registers must start at
>> multiples of N, where N is a power of two)

 ok so you lost me a little, here.  so are the 256 vector registers
sub-divideable into FP16/32/64 for example?  and, if so, would it be a
good idea to have the size be part of the ARF table?  (id, start, end,
bitwidth)?

 or, are you talking about having *another* structure that has that
bitwidth?  i'm slightly confused as to what the difference between the
ARF start/end and the above-mentioned  "mnimum start / maxim end index
of registers in the class" would be.

>> ... and then register classes might be the union of such register
>> classes and traditional register classes.

 that sounds scary.

>>
>> (For example, in AMDGPU we would have a register class that includes all
>> register from the scalar array, with size 2, starting at an even offset,
>> union'd with a class containing some special registers such as VCC.)

 oh right!  yes, i get it.  no, i like it.  in SV there is an
additional entry per register which specifies the bitwidth: default,
half, doubled, and 8-bit.  requires only 2 bits to express.

 with the union system that you propose, the different possible
bitwidths, even though the same underlying registers are used, could
be represented.

>> A similar scheme would have to be used for sub-register indices.
>>
>> I haven't dug too deeply into this yet, and clearly there are quite a
>> number of thorny issues that need to be addressed -- it's a rather big
>> project.

 yes :)

>> But so far I'm not aware of anything absolutely fundamental
>> that would prevent doing this.
>>
>>
>> What I'm asking you at this point
>> ---------------------------------
>> Like I said, I haven't actually started any of this work (and probably
>> won't for some time).
>>
>> However, if you have any fundamental objections to such a change, please
>> speak up now before I or anybody else embarks on this project. I want to
>> be confident that people are okay with the general direction.

 yyeah, it's quite timely.

>> Also, if you have any advice or suggestions (maybe an alternative that
>> would also fit the requirements of the AMDGPU backend), I'd be happy to
>> hear about it!

 a couple of questions: i don't know enough about AMDGPU to be able to
say technically if the approach will work specifically for that
architecture: i can however raise some questions about the general
approach.

 (1) does the AMDGPU have the concept of predication, and if so has a
data structure been considered to cover it.

 (2) what's the difference between the start/end indices of the ARF
and the "class"?

 (3) what's the reason for putting the bitwidth (and alignment) into
the "class", i.e. why have the "class" rather than just have one data
structure containing everything: the ARF?

l.