[llvm-dev] [RFC] Array Register Files

Mon Oct 8 10:16:13 PDT 2018

First: To help the discussion not get sidetracked too much: We do support vector registers today, if your architecture needs to model consecutive registers or tuples you can do this today. Targets like AMDGPU, Hexagon, Apples GPU target do this and it works for them.

With that out of the way, there are some situations in which it doesn't work as well as it could, and this is what this discussion is about.
Today you need to specify tuples upfront in tablegen, you need a separate register class for every size of tuple and easily end up with thousands of tuple registers.

> On Oct 7, 2018, at 10:39 AM, Nicolai Hähnle via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi all,
> 
> There's a rather major piece of work that's been in the back of my mind for a while now. Before I actually start any work on it, I'd like to hear people's opinions, if any.
> 
> tl;dr: I'd like to augment the CodeGen physical register / register class infrastructure with a mechanism to represent large regular register files more efficiently.
> 
> The motivation is that the existing infrastructure really isn't a good fit for the AMDGPU backend. We have ~104 scalar registers and 256 vector registers. In addition to the sheer number of registers, there are some qualitative factors that set us apart from most (all?) other backends:
> 
> 1. The order of register matters: if we use only a subset of registers, and that subset is on the low end of the range, we can run more work in parallel on the hardware. (This is modeled with regalloc priorities today.)
> 
> 2. We can indirectly index into both register files. If a function has an alloca'd array of 17 values, we may want to lower that as 17 consecutive registers and access the array using indirect access instructions.
Just to rephrase your point, because in principle you can start supporting such an indirect access feature today: The problem is that making a register classes for N-tuple registers creates a huge amount of registers and forces you to specify them in tablegen up-front.

> 
> 3. Even this aside, the number of register classes we'd really need is quite staggering. We have machine instructions taking operands with anywhere from 1 to 12 consecutive registers.
> 
> Modeling this as register classes with sub-registers is clearly not a good match semantically, let alone from a compile time performance point of view. Today, we take effectively a runtime performance hit in some cases due to not properly modeling the properties of our register files.
We should make an effort to name concrete issues. So we can judge what the upside of all the work would be. Things I can think of right now:
- Algorithms modeled based on physregs and/or using MCRegAliasIterator are typically so slow as to be unusable for GPUs today because of the huge number of registers. I remember a number of instances where we had to rewrite code to be based on register units to get reasonable compile time performance.

Is there anything else?

> 
> 
> What I'd like to have
> ---------------------
> I'd like to introduce the notion of array register files. Physical registers in an ARF would be described by
> 
> - the register array ID
> - the starting index of the "register"
> - the ending index of the "register" (inclusive)
> 
> This can be conveniently encoded in the 31 bits we effectively have available for physical registers (e.g. 13 bits for start / end, 5 bits for register array ID).
> 
> Register array ID 0 would be reserved for the familiar, TableGen-defined registers (we would still have some of those in AMDGPU, even with this change).
> 
> It would necessarily have to be possible to generate register classes for ARFs both in TableGen and on-the-fly while compiling. Base register classes would be defined by:
> 
> - the register array ID
> - the minimum start / maximum end index of registers in the class
> - the size of registers in the class
> - the alignment of registers in the class (i.e., registers must start at multiples of N, where N is a power of two)
> 
> ... and then register classes might be the union of such register classes and traditional register classes.
> 
> (For example, in AMDGPU we would have a register class that includes all register from the scalar array, with size 2, starting at an even offset, union'd with a class containing some special registers such as VCC.)
> 
> A similar scheme would have to be used for sub-register indices.
I think this would be mainly about creating register classes in an ad-hoc fashion:
- I think we need matching register classes for all classes.
- If you have a register class anyway, then you don't need the fancy encoding anymore and can just stay with register numbers in the class.
- The change compared to today would be that we do not have pre-computed list of physical registers but make them up on-demand.

> 
> I haven't dug too deeply into this yet, and clearly there are quite a number of thorny issues that need to be addressed -- it's a rather big project. But so far I'm not aware of anything absolutely fundamental that would prevent doing this.

This can become tricky fast...
- Would we keep todays notion of register hierarchies as well?
- Will this scheme support arbitrary register constraints?
- Would we keep supporting interfaces like listing the registers contained in a class? Doing MCRegAliasIterator queries (would that even make sense when you create ad-hoc classes on demand?)
- How about code using MCRegisterInfo::getNumRegs()? How about MachineRegisterInfo maintaining double linked lists for all defs/uses of a register?

> 
> 
> What I'm asking you at this point
> ---------------------------------
> Like I said, I haven't actually started any of this work (and probably won't for some time).
> 
> However, if you have any fundamental objections to such a change, please speak up now before I or anybody else embarks on this project. I want to be confident that people are okay with the general direction.

I think this will be a gigantic project. I also fear it willl not help the complexity of the backend infrastructure, since so far it seems we would keep all of the existing mechanisms for dealing with hierarchies around as well and just add one more possibility for tuple registers...
So while I do see the benefits here especially for GPU targets, I am also undecided/have a bad feeling about adding more complexity into generic LLVM infrastructure.

- Matthias