[llvm-dev] [RFC] Array Register Files

Mon Oct 8 20:42:52 PDT 2018

This thread is now mixing two very different notions of length and 
arrays, and your requirements sound very different from Nicolai's. What 
you're describing also sounds like it's misunderstanding what AMDGPU does.

So, for clarity, a brief summary for the AMDGPU situation:

(Up to) "~104" scalar regs at 32b each, (up to) 256 vector regs at 2048b 
each (subdivided into 64 lanes of 32b each). Fixed vector length, not 
configurable by app. Every vector operation is implicitly predicated by 
EXEC mask (stored as 64 bits split across 2 scalar registers, one bit 
per lane).

There are also some packed operations that operate on individual 32b 
lanes as packed 2x16b or 4x8b values, but predication etc. is always on 
the 32b lanes. There is type conversion when loading and storing for a 
variety of formats, but the in-register format is quite constrained.

The "array" part here is that many AMDGPU instructions will access many 
registers; not so much that they necessarily correspond to arrays in the 
sense an "end user" would use the term (although that's possible, as 
Nicolai mentions). To give an example of mundane "arrays", pointers are 
64-bit and generally need to be specified as a pair of scalar 32-bit 
registers.

The most extreme examples are generally to do with texture sampling. For 
example, the instruction

   image_sample_d v[0:3], v[4:11], s[0:7], s[8:11], dmask:0xf

writes the 4 vector registers v0 through v3 (holding the "r", "g", "b" 
and "a" channels of the result); v4 through v11 specify the location to 
sample as well as derivatives of the location with respect to screen x 
and y coordinates (in graphics usage); s0 through s7 contain a 256-bit 
description of the texture resource (address, format, width, height, 
depth if 3D, array count if an array, number of mip levels, and so 
forth), and s8 through s11 contain a 128-bit description of the 
"sampler" (additional parameters describing how to perform the texture 
sampling process).

However, none of these registers are actually _special_ - all of these 
are "standard" registers. Nor are they groups of 8, or 12, or whatever 
registers for the entire runtime of the kernel. Values are just grouped 
into these consecutive registers prior to the sample instruction, and 
return values are likewise grouped. But this represents nothing 
"fundamental" about these values; it's just that some instructions 
access lots of registers and require them to be consecutive (and, in 
some cases, to have initial indices that are multiples of 2 or 4) 
primarily for instruction size reasons. (The instruction above 
references 12 vector registers and 12 scalar registers; specifying 24 
individual register numbers in the opcode would be prohibitive.)

What you're describing for RVV so far sounds _very_ different, and 
sounds to me like it wants somewhat different abstractions. For example, 
the RVV version sounds like it needs to be very aware of the ARF you 
mention, and plan out how it's occupied around whole kernels, whereas 
the AMDGPU use case needs values to be packed in certain ways on the 
lead-up towards thing like sampling instructions (or when dealing with 
scalar 64-bit data), but otherwise mostly treats everything as 
individual registers.

-Fabian

On 10/8/2018 7:02 PM, Luke Kenneth Casson Leighton via llvm-dev wrote:
> nicolai, hi,
> 
> couple things occurred to me, after writing this out
> https://libre-riscv.org/llvm_vector_backend/
> 
> (1) a way to express the link between "what's wanted" and "what's
> available" is needed.  i.e. there needs to be a key-value store.  as
> they stand, proposed ARF and Reg classes only express "what's
> available", they don't express "what's wanted".
> 
> (2) really SV and RVV both absolutely critically require that "length"
> CSR (VL) to be part of the data structures, in order for the registers
> to actually be "arrays", at all.  if there is no length specified (at
> the "what's wanted" level), there's no way for the backend to
> determine "what's available".
> 
> (3) if the length of an array is specified as part of the data
> structures, microarchitectures that don't have that concept can simply
> set that to "1" in all data structures.  i *think* that means that for
> AMDGPU standard vector regs, length would be 1, and for those special
> shader registers, it would be 12.  or 1-12.  or whatever they had been
> globally set to for the duration of the application lifetime.
> 
> (4) VL ties in with robin kruppe's intermediary representation RFC
> (initially designed for RVV). i think it's important to get in touch
> with him on that.
> 
> (5) the idea of unioning traditional register classes is a good one: i
> would hesitate to special-case that.  if the ARF and Reg classes are
> not capable of expressing the traditional register classes, i would
> say that there's something wrong with how the ARF and Reg classes are
> designed.
> 
> l.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>