[llvm-dev] [RFC] Array Register Files

Mon Oct 8 10:53:27 PDT 2018

[hi nicolai, i am so sorry this is enormous! please do take your time.
also is there a resource somewhere, a wiki of some description, where
the team at AMD would be happy to document things collaboratively? i
would be happy to host a temporary home for such a document, if that
would help?].

On Mon, Oct 8, 2018 at 2:54 PM Nicolai Hähnle <nhaehnle at gmail.com> wrote:

> >   SV (the custom extension jacob kindly mentioned) also has register
> > redirection.  it's slightly different in that it's completely
> > transparent: all *SCALAR* operations, if a register is marked in the

> I've just skimmed what's in SV,

  oh! sorry, i forgot to give you a reference (and a caveat that it's
still at the phase of being implemented in a simulator, which is
rapidly hammering down on "that would be nice to have" and turning it
into "this is far more practical, let's go with that", if you get my
drift).

> and it seems a bit different from what
> AMDGPU has, although perhaps it could be put to the same use. There are
> definitely similarities.

 not knowing the full details of AMDGPU, that's interesting to hear.

> AMDGPU doesn't have any of the fancy unrolling of instructions;

 oh? intriguing: when you mentioned that AMDGPU is a vector processor,
i assumed that it would have vector lengths... hmm, oh i know what you
mean: the "implicit hard-macro-loop-unrolling" thing, where this:

 SETVLENGTH 3
 add r0, r4 # normally this is a scalar add

 actually gets translated to:
 for i in range(3):
    add r(0+i), r(4+i) # 3 scalar adds

it was the simplest way i could think of that would mean an
implementor could take a standard "scalar" RISC implementation and
turn it into a "vector" one.  no *actual* modification of the ALUs, at
all.  all the SV logic sits in between the decoder phase and the
instruction issue phase.

> there's
> simply a mode in which an offset from a special register is added to
> certain register references. E.g., if an instruction references vector
> register v7 and the special register contains 13, it'll reference v20
> instead. (The exact details of how to enable this mode differs between
> hardware generations.)

 okay.  so this is similar to SV's "redirection" table.  5-bit key,
6-bit value (and an element width and a boolean "scalar/vector" flag)
that way, the *standard* scalar ops (normally restricted to 32 regs)
redirect to 64 actual target regs.

in this way, there's no need to have a scalar-vector opcode, a
vector-scalar opcode, a vector-vector opcode *and* a scalar-scalar
opcode: it's just... all one opcode.

> >>> 3. Even this aside, the number of register classes we'd really need is
> >>> quite staggering. We have machine instructions taking operands with
> >>> anywhere from 1 to 12 consecutive registers.
> >
> >   is that length part of the instruction or is it set up by way of an
> > indirect status register of some kind?
>
> It's complicated :)

:)

> As far as LLVM is concerned, it's always part of the MachineInstr to
> keep things sane.

 ok.  so, some state gets set up, and the instruction's meaning gets
changed? (that's how SV does it).  or is it more complex than that?

> In hardware, the things that use up to 12 consecutive registers are
> texture sampling instructions.

 ok!  so, here, if an instruction is issued, some state has to be set
(somewhere), and, if set, the instruction uses registers s[8] through
to s[15] instead of s[8] through to s[12]?  the important thing being,
it's *not* the *instruction* that gets modified, it's the state
associated with the *register* that's modified, is that right? (for a
given approximation of "right")

 (because, if so, that's pretty much exactly what SV does)

> The binary instruction encoding only
> contains the type of sampling (e.g., whether user-specified gradients
> are used), but not the dimensionality of the texture (1D vs. 2D vs. 3D),
> which also affects the number of source registers.

 that's fascinating.  extremely sophisticated.

> >>> I'd like to introduce the notion of array register files. Physical
> >>> registers in an ARF would be described by
> >>>
> >>> - the register array ID
> >>> - the starting index of the "register"
> >>> - the ending index of the "register" (inclusive)
> > [....]
> > * ARF1 = { id: 0, start=r5, end=r9}
> > * ARF2 = { id: 1, start=r6, end=r10}

> Maybe I misunderstand you, but I don't envision having an explicit "ARF
> table" like that in LLVM, precisely because I want to get away from
> opaque physreg indices.
>
> For your examples above, the "registers" (in the LLVM CodeGen sense)
> would be represented as (and using 12 bits for start / end just to make
> the hex look nicer):
>
> ARF1 = 0x01009005
> ARF2 = 0x0100a006

 ok so just bit-field representations.  that works.  SV (presently)
has that global vector-length rather than an explicit per-register
length, that's not a problem: it just means that the (global) VL state
would need to be pushed on the stack if ever it needed to be changed.

> (Assuming that 0x01 is used as the integer array ID; 0x02 could then be
> used for RISC-V float registers.)

 oh.   err... ah so the ID is a *type* indicator rather than an actual
indicator of a register *number*?  type 0x01 = int, type ox02 = float.
okaaay, got it.

 and... if it's a scalar it would be obviously represented
"0x01009009", as the start would equal the end (inclusive).

and ID for AMDGPU would have 0x03 for "texture" register, for example?

 *thinks*... hmmm... if that table were to be expanded out in full,
for SV, it would contain... 63+62.... 32 entries.  x0 is zero in RISCV
(so no sense "vectorising" that), x1 could have VL set anywhere from 1
to 63 (to mean "when you use x1, please do a hardware-loop issuing
scalar instructions on x1, and on x2, and on x3 ..... x63), then x2
could have VL set anywhere from 1 to 62 (again, hitting a limit
referring to x63) and so on.

 it's a very, very large table, which, i think from the consecutive
overlapping / re-use in AMDGPU, would not accctually be a genuine "if
you use this ARF you have *all* possibie entries remaining to be able
to use immediately" if you get my drift

 i.e. if a specific texture ARF were to be used which had a length of
12, it's *NOT* just that *one* ARF that would have to be marked as "in
use", its use would actually knock *multiple* holes in the ARF range:
all those which referred *to* the underlying actual registers covered
by that length range would need to be marked as "unavailable".

would you concur?

> I have no idea how you'd handle the indirection in LLVM. That's a whole
> other can of worms that AMDGPU luckily doesn't have to deal with. :)

 :)

> In particular, I don't see how what I'm talking about would help you
> with managing slots in the indirection CAM. It would only help you get a
> more natural "encoding" of the underlying (already redirected) registers
> that instructions are using, for the purpose of register allocation etc.

 okaaay, so you're considering setting up the indirection, for a
particular one-shot purpose (a texture rendering for example) and not
changing it, ever, until that execution is completed?

 which seems like an absolutely perfectly reasonable and rational
thing to do, and has the advantage of making the fact that redirection
exists completely sane rather than absolutely mind-bendingly NUTS :)

 i'd been thinking of designing some conventions for SV's registers:
two in particular stand out: one is LOAD-Multi/STORE-Multi (for entry
and exit to functions), and the other is for userspace-kernelspace
context-switches. set VL=63, do a stack-store operation on x1, BLAM,
the entire register file is saved to stack with a single instruction.

 going beyond that: setting some caller-callee conventions as well so
that certain ranges of registers are to be used as function call
parameters, some as temporaries and so on, all makes perfect sense,
set up once, and leave it *well* alone.

> Also, there are a whole bunch of issues with overlapping registers when
> it comes to live intervals etc., and I admit I haven't thought through
> all the implications.

 i can honestly say that i feel the most sensible thing there would be
to leave overlaps to assembly writers for now.  unless you happen to
have access to a couple of 190 IQ geniuses at AMD... who are planning
to stay there for life :)

 i worked for Aspex Semiconductors back in 2003 (well before they were
bought by Ericsson), and the registers there... well... um... they
were constructed dynamically from 2-bit ALUs.  so hypothetically you
could have the entire processor operate on single 8192-bit-long
"registers", or you could break it down into as small as 4096 2-bit
SIMD registers.

 it was insane.

 we actually had to hand-write assembly algorithms in spreadsheets of
all things, emulating all possible permutations of possible register
sizes, calculating the time taken for the algorithm if the register
size was 2, or 4, or 8 .... all because the memory load / store was a
fixed time (obviously), but any given algorithm would be a different
duration.  to know the optimal algorithm speed you *literally* had to
go through all possible permutations.

 productivity was measured in days per line of code.  no, not lines of
code per day: DAYS per line of code.

 bottom line: if there's *any way* that you can get the registers set
up (even if it's by hand or by convention) in advance, and not change
their configurations except when that particular algorithm is done and
out of the CPU, i'd really *really* recommend and concur with that
approach, initially.

 even if the proposed backend were to automate the generation of
instructions based on pre-arranged (hand-allocated) register
allocations, that would be infinitely better than how things had to be
done with the Aspex ASP.

 once that's in place, later you could look at running optimisation
loops (dumb brute-force ones, even) which test different register
allocations, and, through a rapid iteration process (or just
brute-force search) get a better handle on things.

> >   ok so you lost me a little, here.  so are the 256 vector registers
> > sub-divideable into FP16/32/64 for example?  and, if so, would it be a
> > good idea to have the size be part of the ARF table?  (id, start, end,
> > bitwidth)?
>
> Since we have a SIMT/SPMD/whatever-you-want-to-call-it model in AMDGPU,
> we mostly tend to think of the vector registers as simply 32-bit
> registers. In reality, they have 64 lanes of 32 bits each.
>
> Unlike CPU SIMD architectures, we don't represent 64-bit values as 32
> lanes of 64 bits; instead, 64-bit values are represented as two
> consecutive "32-bit" registers. So we store 32-bit values in v0, v1,
> ..., and 64-bit values in v[0:1], v[1:2], ...

 ok so *exactly* how SV works :)  except not with the overlaps, perhaps.

 in SV it's permitted to break down as far as 8-bit in SV, although i
would not recommend actually breaking the register file itself down
into 64 x 8 separate and distinct 8-bit "registers".  [if done at that
depth, the reg-renaming needed for an OoO microarchitecture would be a
huge burden, as a single 64-bit operation would, in hardware, need to
carry EIGHT renamed registers with it.  not nice]

 so, i said "x1" earlier... actually that is as follows:

union {
    unsigned char b_reg[64*8];
    unsigned short s_reg[64*4];
    unsigned int i_reg[64*2];
    unsigned long l_reg[64];
}

and so x1 would be accessed as either:

regs->b_reg[1*8] OR
regs->s_reg[1*4] OR
regs->i_reg[1*2] OR
regs->i_reg[1]

but it is important to note that there is no way to directly access
regs->b_reg[3], for example [except through vector predication].

in SV it is highly likely that implementors would go "wtf? i am *not*
subdividing the registerfile into 64*8 separate and distinct
registers!" and would instead settle for a predicated SIMD ALU at some
depth that they feel comfortable with (similar to AMDGPU's choice,
below, of 32).

so effectively very very similar to AMDGPU, i think?  except where
AMDGPU would allow v[0:1], v[1:2], v[2:3], SV would *not* allow
v[1:2], it would *only* allow v[0:1] and v[2:3], in effect, and i
believe this may have been what you were referring to about the "class
restrictions" (below).

HOWEVER... it just occurred to me: on an RV32 system, setting
"elwidth=default x 2" on a given register would pretty much give the
*exact* same behaviour in SV as AMDGPU.  in that particular case, an
"add x2, x6" would result in x2+x3 being treated as a 64-bit operand,
and x6+x7 likewise.  it would be equivalent to AMDGPU terminology "ADD
v[2:3], v[6:7]".

also... please don't freak out, i'm having difficulty coping with the
microarchitectural implications myself... in both SV and RVV it's
possible to set *different* element widths on the source operands and
destination operands, and the microarchitecture has the fun job of
working out how to convert and sign-extend between the different
element widths.

> That way, every thread of the original program stays within one lane of
> the SIMD structure.

 indeed.  SV by contrast is designed to fit either on a [predicated!]
SIMD microarchitecture *or* a multi-issue superscalar one.  if a
multi-issue superscalar architecture is to be used, that "parallel
loop" you saw pseudo-code for in the spec would just... push
individual element-wise "unpackaged" scalar ops into the *standard*
scalar instruction FIFO.

 if a SIMD microarchitecture is chosen, all that happens is that if
the SIMD width is 4, then 4 bits of predicate are pushed down along
with 4 batches of operands at a time, into the SIMD ALU.  at the end
of a loop (where the last elements of the loop are only 3, 2 or 1
long), implicit predication will mask off the unused lanes,
automatically.

 goodbyyye SIMD corner-cases :)
https://www.sigarch.org/simd-instructions-considered-harmful/

 point being: as far as the compiler / assembly is concerned, the
underlying microarchitecture is opaque to SV.  apart from speed
differences, you genuinely don't care.

> For 16-bit, we use half-registers, i.e. each 16-bit value uses either
> the low or high half of a "32-bit" vector register.

 okay, so, starting to get SIMD-like, there.  and i see you have
predication, below, or "lane masking".

> >   or, are you talking about having *another* structure that has that
> > bitwidth?  i'm slightly confused as to what the difference between the
> > ARF start/end and the above-mentioned  "mnimum start / maxim end index
> > of registers in the class" would be.
>
> The "min start / max end" is kind of a detail thing which we might not
> actually need for AMDGPU. I was thinking that some architecture might
> have encodings that limit the reachable range of instruction operands
> (kind of like the RISC-V compressed instructions thing).

 from what i've seen, traditional vector architectures typically
don't: their life is complicated enough as it is.  SV on the other
hand, it would actually be necessary to explicitly add logic to
exclude the C (compressed) instructions, because they're actually a
"remapper" to standard 32-bit opcodes.  it's just that some of the
remapping has, as you say, restricted range.

 ok so yes, having a way to express that restricted range, excellent idea.

> To describe such MachineInstr opcodes in an LLVM backend, you need
> register classes with a few more restrictions.
>
>
> >>> ... and then register classes might be the union of such register
> >>> classes and traditional register classes.
> >
> >   that sounds scary.
>
> I know. But unfortunately necessary :)

 i'm starting to get where you're coming from.  hasn't totally sunk
in... i *believe* it may be the union thing i expressed above (and the
architectural similarities between AMDGPU and SV are closer than we
initially thought).

 so far, i believe i've been able to determine that:

* where the AMDGPU register files are broken down into 32-bit
segments, SV can break down as far as 8-bit (although implementors are
free to not do so, and instead to implement a SIMD microarchitecture
*as long as it is a predicated one*)
* where the AMDGPU register file naming conventions are always at the
32-bit level, SV can be either 32, or 64, or 128, depending on whether
the implementation is RV32, RV64 or RV128.

> >   (1) does the AMDGPU have the concept of predication, and if so has a
> > data structure been considered to cover it.
>
> Yes, although it's mostly implied because we do the whole SIMT/SPMD
> thing. There's a dedicate EXEC register which masks lanes for all vector
> instructions.

 ok, so this is where SV definitely significantly differs: there's a
secondary CSR (control status register) "predication" table, where (in
english) the logic goes as follows:

 "before the instruction is executed, check to see if the destination
register has an entry in the "predication" key-value store.  if it
does, use the listed target register as a predicate".

 in this way it is possible to have *all* and any registers be
predicated.  in combination with redirection, it would even be
possible for one operation to be predicated (targetting the same
underlying registers) and the very next operation *not* be
predicated... or for the predicate to be inverted... but using the
*exact* same underlying registers.

 so that would allow if-then-else constructs to be done, for example,
without needing a global mask instruction in between them.

> >   (2) what's the difference between the start/end indices of the ARF
> > and the "class"?
>  >
> >   (3) what's the reason for putting the bitwidth (and alignment) into
> > the "class", i.e. why have the "class" rather than just have one data
> > structure containing everything: the ARF?
>
> For both of those, I hope the above already clarified it a bit,

 yes.  getting there.

> but the
> general answer is that "register class" describes constraints e.g. on
> the operands of specific machine instructions which the register
> allocator would follow.

 makes sense.  i wouldn't have thought of that approach.

> For example: If you had 256 registers, but some instruction encodings
> are limited to the low 64 registers, and you're looking at an
> instruction that takes consecutive 4-element registers, you'd use a
> register class with "min start = 0, max end = 66" (assuming the
> limitation is purely an encoding limitation).
>
> The register allocator would then allocate some consecutive 4 registers
> in that range, e.g. it could come up with "register start = 17, end = 20".
>
> (In the AMDGPU case, we also have alignment restrictions; e.g. scalar
> 4-register descriptors must be aligned to multiples of 4: s[0:3],
> s[4:7], ...)

 yes.  so there would be effectively a similar restriction for SV [1].
if a register's element width CAM entry is set to "16 bit" for example
then that would result in any 64-bit or 32-bit "add" actually being
interpreted as a 16-bit add...

... however, looking back up at that union table i wrote (for
reference), with an instruction "add x13, x13" it would *NOT* be
possible to do tb->s_reg[13] for example, it would *only* be possible
to do tb->s_reg[13*2].  likewise if the entry was set to 8-bit, it
would only be possible to do tb->b_reg[13*4].

this is really, really cool.  it would i think be great to get ARM's
input, here, as well as the RISC-V RVV Working Group's input, to see
how and if those respective architectures would fit into the data
structure(s) envisaged.  also, i am bcc'ing a couple of people from
other companies that i know are doing RISC-V vectorisation
architectures.  they may also be interested.

also, can i suggest: just as with SV needing to do the trick of
putting the "global" vector length onto the stack (as an
implementation detail) where AMDGPU can just set up the lengths of
registers once and only once, that we consider creating a similar
scheme for predication that is as "general-purpose", even though
things would be reversed for AMDGPU compared to SV: it would be AMDGPU
having to set a "global register", where SV could set up a table once
and only once.

other architectures may have something in between (or just different).
Nyuzi (by Jeff Bush), i know, for example, actually has the
predication as a fixed part of the opcode: 4 bits of the (64-bit)
opcode are reserved as a mask.  RVV has only the one predicate
register (v1) https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc
- however its use is specifically coded (activated) in every single
vector opcode (2 bits are reserved in every opcode to indicate if it's
to be used at all, inverted or not, and so on).  i don't know enough
about ARM ASE to say what they might have?

i feel it would be good to properly recognise the predication as an
integral part of the backend data structures (even if some
architectures had to save/restore global state to get the "full"
features), because predication is fundamentally how parallel
architectures appear to do branches.

many *many* apologies, this took a long time to write, i do appreciate
that this is a complex topic.

l.

 [1] [i need to document this] although maybe it's not a good idea to
mention that, due to exceptions occurring in the middle of a parallel
operation due to one of the "real" scalar ops requiring e.g. a virtual
memory trap, the "loop" you spotted in the spec pseudo-code actually
has to be re-entrant, and that in turn means that the "element offset
index" has to be saveable and restorable state.  i am reluctant to
mention it in case compiler writers decide to abuse it to overcome the
alignment limitations :)