[llvm-dev] [RFC] Array Register Files

Mon Oct 8 06:54:13 PDT 2018

On 07.10.2018 21:56, Luke Kenneth Casson Leighton wrote:
> [reordering a bit]
> 
> On Sun, Oct 7, 2018 at 6:45 PM Jacob Lifshay <programmerjake at gmail.com> wrote:
> 
>> This sounds like it would also be useful for allocating the consecutive
>> registers needed to implement vectors in SimpleV, a parallelism
>> extension to RISC-V.
> 
>   yes it does.  comments inline, below
> 
>> On Sun, Oct 7, 2018, 10:39 Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
>>> The motivation is that the existing infrastructure really isn't a good
>>> fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector
>>> registers.
> 
>   256 vector registers is the number that RISC-V RVV future extensions
> plan to have.  i think the RVV working group would really like to know
> of any plans in this area.
> 
>>> 1. The order of register matters: if we use only a subset of registers,
>>> and that subset is on the low end of the range, we can run more work in
>>> parallel on the hardware. (This is modeled with regalloc priorities today.)
>>>
>>> 2. We can indirectly index into both register files. If a function has
>>> an alloca'd array of 17 values, we may want to lower that as 17
>>> consecutive registers and access the array using indirect access
>>> instructions.
> 
>   SV (the custom extension jacob kindly mentioned) also has register
> redirection.  it's slightly different in that it's completely
> transparent: all *SCALAR* operations, if a register is marked in the
> redirection table, will (a) be redirected through a key-value table
> from 5 bits (the standard RISC-V register file size) to 6 bits and (b)
> checked to see if it's a vector.

I've just skimmed what's in SV, and it seems a bit different from what 
AMDGPU has, although perhaps it could be put to the same use. There are 
definitely similarities.

AMDGPU doesn't have any of the fancy unrolling of instructions; there's 
simply a mode in which an offset from a special register is added to 
certain register references. E.g., if an instruction references vector 
register v7 and the special register contains 13, it'll reference v20 
instead. (The exact details of how to enable this mode differs between 
hardware generations.)

>>> 3. Even this aside, the number of register classes we'd really need is
>>> quite staggering. We have machine instructions taking operands with
>>> anywhere from 1 to 12 consecutive registers.
> 
>   is that length part of the instruction or is it set up by way of an
> indirect status register of some kind?

It's complicated :)

As far as LLVM is concerned, it's always part of the MachineInstr to 
keep things sane.

In hardware, the things that use up to 12 consecutive registers are 
texture sampling instructions. The binary instruction encoding only 
contains the type of sampling (e.g., whether user-specified gradients 
are used), but not the dimensionality of the texture (1D vs. 2D vs. 3D), 
which also affects the number of source registers. The dimensionality of 
the texture is determined by its image descriptor, which is a 
MachineOperand (256-bit scalar register, which is in fact just a 
sequence of 8 consecutive scalar 32-bit registers, which we denote in 
assembly as e.g. s[8:15]).

>>> What I'd like to have
>>> ---------------------
>>> I'd like to introduce the notion of array register files. Physical
>>> registers in an ARF would be described by
>>>
>>> - the register array ID
>>> - the starting index of the "register"
>>> - the ending index of the "register" (inclusive)
> 
>   let me think that through from the SV extension perspective.  with
> SV, i am following the example of RVV.  there is one "global" status
> register (example assembly code can be seen here
> https://www.sigarch.org/simd-instructions-considered-harmful/)   i did
> give serious consideration to having a per-register length... am still
> mulling that one over as it takes up a lot of extra CSR space.
> 
>   so if such a notion existed, then the SV-LLVM back-end would need to
> look at the table, go "hmm, this ARF entry requires a length of [e.g.]
> 6, the current global length is 4, let's push that on the stack and
> change the global".  which would be perfectly manageable.
> 
>   so yes, as long as those entries in the table cover contiguous
> registers, _great_.
> 
>   also, one thing: do you envisage the ARF table covering *overlapping*
> registers?  the reason i ask is because SV doesn't care if they are
> (i.e. it's the compiler's problem to sort out).
> 
>   i.e.:
> * ARF1 = { id: 0, start=r5, end=r9}
> * ARF2 = { id: 1, start=r6, end=r10}
> 
> the reason i ask is because on the isa-dev list jacob proposed an
> algorithm that could actually be implemented by issuing a parallel-add
> involving two (equal length) overlapping ARF entries: add ARF1, ARF2.

Maybe I misunderstand you, but I don't envision having an explicit "ARF 
table" like that in LLVM, precisely because I want to get away from 
opaque physreg indices.

For your examples above, the "registers" (in the LLVM CodeGen sense) 
would be represented as (and using 12 bits for start / end just to make 
the hex look nicer):

ARF1 = 0x01009005
ARF2 = 0x0100a006

(Assuming that 0x01 is used as the integer array ID; 0x02 could then be 
used for RISC-V float registers.)

So your `add ARF1, ARF2` could perhaps be represented in LLVM after 
register allocation as a MachineInstr whose MachineOpcodes are simply 
registers with the above numeric representation.

I have no idea how you'd handle the indirection in LLVM. That's a whole 
other can of worms that AMDGPU luckily doesn't have to deal with. :)

In particular, I don't see how what I'm talking about would help you 
with managing slots in the indirection CAM. It would only help you get a 
more natural "encoding" of the underlying (already redirected) registers 
that instructions are using, for the purpose of register allocation etc.

Also, there are a whole bunch of issues with overlapping registers when 
it comes to live intervals etc., and I admit I haven't thought through 
all the implications. I'd like to make it all work, though, and I 
consider that to be one of the most important things to dig into if I 
actually start this. I think it'd be nice to avoid sub-registers 
altogether in this representation (simply always using the canonical 
representation), but we'll see how feasible that is.

>>> This can be conveniently encoded in the 31 bits we effectively have
>>> available for physical registers (e.g. 13 bits for start / end, 5 bits
>>> for register array ID).
> 
>   cool.  SV is 6 bits for start / end, 6 bits for length.
> 
>>> Register array ID 0 would be reserved for the familiar, TableGen-defined
>>> registers (we would still have some of those in AMDGPU, even with this
>>> change).
> 
>   i'm not familar with what tablegen is, so can't comment.
> 
>>> It would necessarily have to be possible to generate register classes
>>> for ARFs both in TableGen and on-the-fly while compiling. Base register
>>> classes would be defined by:
>>>
>>> - the register array ID
>>> - the minimum start / maximum end index of registers in the class
>>> - the size of registers in the class
>>> - the alignment of registers in the class (i.e., registers must start at
>>> multiples of N, where N is a power of two)
> 
>   ok so you lost me a little, here.  so are the 256 vector registers
> sub-divideable into FP16/32/64 for example?  and, if so, would it be a
> good idea to have the size be part of the ARF table?  (id, start, end,
> bitwidth)?

Since we have a SIMT/SPMD/whatever-you-want-to-call-it model in AMDGPU, 
we mostly tend to think of the vector registers as simply 32-bit 
registers. In reality, they have 64 lanes of 32 bits each.

Unlike CPU SIMD architectures, we don't represent 64-bit values as 32 
lanes of 64 bits; instead, 64-bit values are represented as two 
consecutive "32-bit" registers. So we store 32-bit values in v0, v1, 
..., and 64-bit values in v[0:1], v[1:2], ...

That way, every thread of the original program stays within one lane of 
the SIMD structure.

For 16-bit, we use half-registers, i.e. each 16-bit value uses either 
the low or high half of a "32-bit" vector register.

>   or, are you talking about having *another* structure that has that
> bitwidth?  i'm slightly confused as to what the difference between the
> ARF start/end and the above-mentioned  "mnimum start / maxim end index
> of registers in the class" would be.

The "min start / max end" is kind of a detail thing which we might not 
actually need for AMDGPU. I was thinking that some architecture might 
have encodings that limit the reachable range of instruction operands 
(kind of like the RISC-V compressed instructions thing).

To describe such MachineInstr opcodes in an LLVM backend, you need 
register classes with a few more restrictions.

>>> ... and then register classes might be the union of such register
>>> classes and traditional register classes.
> 
>   that sounds scary.

I know. But unfortunately necessary :)

>>> (For example, in AMDGPU we would have a register class that includes all
>>> register from the scalar array, with size 2, starting at an even offset,
>>> union'd with a class containing some special registers such as VCC.)
> 
>   oh right!  yes, i get it.  no, i like it.  in SV there is an
> additional entry per register which specifies the bitwidth: default,
> half, doubled, and 8-bit.  requires only 2 bits to express.
> 
>   with the union system that you propose, the different possible
> bitwidths, even though the same underlying registers are used, could
> be represented.
> 
>>> A similar scheme would have to be used for sub-register indices.
>>>
>>> I haven't dug too deeply into this yet, and clearly there are quite a
>>> number of thorny issues that need to be addressed -- it's a rather big
>>> project.
> 
>   yes :)
> 
>>> But so far I'm not aware of anything absolutely fundamental
>>> that would prevent doing this.
>>>
>>>
>>> What I'm asking you at this point
>>> ---------------------------------
>>> Like I said, I haven't actually started any of this work (and probably
>>> won't for some time).
>>>
>>> However, if you have any fundamental objections to such a change, please
>>> speak up now before I or anybody else embarks on this project. I want to
>>> be confident that people are okay with the general direction.
> 
>   yyeah, it's quite timely.
> 
>>> Also, if you have any advice or suggestions (maybe an alternative that
>>> would also fit the requirements of the AMDGPU backend), I'd be happy to
>>> hear about it!
> 
>   a couple of questions: i don't know enough about AMDGPU to be able to
> say technically if the approach will work specifically for that
> architecture: i can however raise some questions about the general
> approach.
> 
>   (1) does the AMDGPU have the concept of predication, and if so has a
> data structure been considered to cover it.

Yes, although it's mostly implied because we do the whole SIMT/SPMD 
thing. There's a dedicate EXEC register which masks lanes for all vector 
instructions.

>   (2) what's the difference between the start/end indices of the ARF
> and the "class"?
 >
>   (3) what's the reason for putting the bitwidth (and alignment) into
> the "class", i.e. why have the "class" rather than just have one data
> structure containing everything: the ARF?

For both of those, I hope the above already clarified it a bit, but the 
general answer is that "register class" describes constraints e.g. on 
the operands of specific machine instructions which the register 
allocator would follow.

For example: If you had 256 registers, but some instruction encodings 
are limited to the low 64 registers, and you're looking at an 
instruction that takes consecutive 4-element registers, you'd use a 
register class with "min start = 0, max end = 66" (assuming the 
limitation is purely an encoding limitation).

The register allocator would then allocate some consecutive 4 registers 
in that range, e.g. it could come up with "register start = 17, end = 20".

(In the AMDGPU case, we also have alignment restrictions; e.g. scalar 
4-register descriptors must be aligned to multiples of 4: s[0:3], 
s[4:7], ...)

Cheers,
Nicolai

> 
> l.
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.