[llvm-dev] [RFC] Array Register Files

Mon Oct 8 11:08:03 PDT 2018

On 08.10.2018 19:16, Matthias Braun wrote:
> First: To help the discussion not get sidetracked too much: We do support vector registers today, if your architecture needs to model consecutive registers or tuples you can do this today. Targets like AMDGPU, Hexagon, Apples GPU target do this and it works for them.

Right. With the caveat that working on AMDGPU is what drove me to bring 
this up in the first place :)

> With that out of the way, there are some situations in which it doesn't work as well as it could, and this is what this discussion is about.
> Today you need to specify tuples upfront in tablegen, you need a separate register class for every size of tuple and easily end up with thousands of tuple registers.

Almost 40k of them, to be precise, if we were serious about actually 
using all features of the hardware in AMDGPU. We are not doing that today.

>> On Oct 7, 2018, at 10:39 AM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> Hi all,
>>
>> There's a rather major piece of work that's been in the back of my mind for a while now. Before I actually start any work on it, I'd like to hear people's opinions, if any.
>>
>> tl;dr: I'd like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.
>>
>> The motivation is that the existing infrastructure really isn't a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:
>>
>> 1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)
>>
>> 2. We can indirectly index into both register files. If a function has an alloca'd array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.
> Just to rephrase your point, because in principle you can start supporting such an indirect access feature today: The problem is that making a register classes for N-tuple registers creates a huge amount of registers and forces you to specify them in tablegen up-front.

Right.

>> 3. Even this aside, the number of register classes we'd really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.
>>
>> Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.
> We should make an effort to name concrete issues. So we can judge what the upside of all the work would be. Things I can think of right now:
> - Algorithms modeled based on physregs and/or using MCRegAliasIterator are typically so slow as to be unusable for GPUs today because of the huge number of registers. I remember a number of instances where we had to rewrite code to be based on register units to get reasonable compile time performance.
> 
> Is there anything else?

I imagine there's a benefit to register allocation if we could just be 
honest about the linear nature of the register file. Doing a good job of 
allocating differently-sized registers from the same file is hard enough 
as it is, doing so in an unstructured sea of register units as opposed 
to a linearly ordered register file certainly doesn't help.

>> What I'd like to have
>> ---------------------
>> I'd like to introduce the notion of array register files. Physical registers in an ARF would be described by
>>
>> - the register array ID
>> - the starting index of the "register"
>> - the ending index of the "register" (inclusive)
>>
>> This can be conveniently encoded in the 31 bits we effectively have available for physical registers (e.g. 13 bits for start / end, 5 bits for register array ID).
>>
>> Register array ID 0 would be reserved for the familiar, TableGen-defined registers (we would still have some of those in AMDGPU, even with this change).
>>
>> It would necessarily have to be possible to generate register classes for ARFs both in TableGen and on-the-fly while compiling. Base register classes would be defined by:
>>
>> - the register array ID
>> - the minimum start / maximum end index of registers in the class
>> - the size of registers in the class
>> - the alignment of registers in the class (i.e., registers must start at multiples of N, where N is a power of two)
>>
>> ... and then register classes might be the union of such register classes and traditional register classes.
>>
>> (For example, in AMDGPU we would have a register class that includes all register from the scalar array, with size 2, starting at an even offset, union'd with a class containing some special registers such as VCC.)
>>
>> A similar scheme would have to be used for sub-register indices.
> I think this would be mainly about creating register classes in an ad-hoc fashion:
> - I think we need matching register classes for all classes.
> - If you have a register class anyway, then you don't need the fancy encoding anymore and can just stay with register numbers in the class.
> - The change compared to today would be that we do not have pre-computed list of physical registers but make them up on-demand.

The advantage of the start / end encoding is that checking for physical 
register containment / overlap / etc. becomes trivial.

>> I haven't dug too deeply into this yet, and clearly there are quite a number of thorny issues that need to be addressed -- it's a rather big project. But so far I'm not aware of anything absolutely fundamental that would prevent doing this.
> 
> This can become tricky fast...
> - Would we keep todays notion of register hierarchies as well?
> - Will this scheme support arbitrary register constraints?

What do you mean by this?

> - Would we keep supporting interfaces like listing the registers contained in a class? Doing MCRegAliasIterator queries (would that even make sense when you create ad-hoc classes on demand?)
> - How about code using MCRegisterInfo::getNumRegs()? How about MachineRegisterInfo maintaining double linked lists for all defs/uses of a register?

All good questions, and I certainly don't claim to have answers to them yet!

I can tell you thought that at least in AMDGPU, the uses of 
MCRegisterInfo::getNumRegs() and MCRegAliasIterator seem to be more 
accidental and due to the way the current design happens to be. It looks 
like they could be replaced by something more elegant with the scheme I 
described.

I haven't looked at other targets yet, though it goes without saying 
that I don't want to ignore them.

>> What I'm asking you at this point
>> ---------------------------------
>> Like I said, I haven't actually started any of this work (and probably won't for some time).
>>
>> However, if you have any fundamental objections to such a change, please speak up now before I or anybody else embarks on this project. I want to be confident that people are okay with the general direction.
> 
> I think this will be a gigantic project.

Agreed.

> I also fear it willl not help the complexity of the backend infrastructure, since so far it seems we would keep all of the existing mechanisms for dealing with hierarchies around as well and just add one more possibility for tuple registers...
> So while I do see the benefits here especially for GPU targets, I am also undecided/have a bad feeling about adding more complexity into generic LLVM infrastructure.

Point taken.

But what if we can actually transition *everybody* to something that 
expresses registers as (start, end) pairs over an underlying linear 
"array register file"? Maybe it sounds even crazier at first, but bear 
with me ;)

Consider x86, for example. What if we said that we think of the base x86 
register file (with rax, rbx, etc.) as an array of 16 * 8 = 128 bytes. 
In this scheme, we'd have:

AL = (0, 0)
AH = (1, 1)
AX = (0, 1)
EAX = (0, 3)
RAX = (0, 7)
BL = (8, 8)

and so on. Probably we'd want to define many register classes as 
explicit lists, although there's something to be said for defining 
32-bit registers as: "size = 4, start = 0 (mod 8), max_end = 63"

Then everybody would benefit from fast physical register overlap check, 
at least.

We may even be able to get rid of the whole concept of subregisters, if 
we can find a good data structure for looking up defs and uses of 
intervals in the underlying linear register file.

An important question is: Is there a target that has registers which 
*cannot* be expressed as intervals over an underlying linear array of 
register units?

Admittedly all of this is largely me brainstorming and trying to bounce 
ideas off you and others.

Cheers,
Nicolai
-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.