[LLVMdev] TargetRegisterInfo and "infinite" register files

Mon May 16 10:00:10 PDT 2011

On May 16, 2011, at 6:52 AM, Justin Holewinski wrote:

> Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes.  I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file.  The architecture is PTX, which is more of an intermediate form than a final assembly language.  The format is essentially three-address code, with "virtual" registers instead of "physical" registers.  After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints).  However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.

What kind of optimizations can ptxas do? Can it hoist computations out of loops? Can it split live ranges? Can it coalesce live ranges?

> Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set."

It's a common misconception, but that is not what LLVM's register allocators do. They try to minimize the amount of executed spill code given the fixed set of registers.

I wouldn't recommend dynamically growing the register file. You are likely to get super-linear compile time, and it is not clear that register allocation would achieve anything compared to simply outputting virtual registers. Surely, ptxas' register allocator can reuse a register for non-overlapping live ranges. That is all you would get out of this.

> The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen.  This works, but is not scalable.  If the register allocator runs out of registers, spill code must be generated.  However, the "optimal" solution in this case would be to extend the register file.  A few alternatives I have come up with are:
> 	• Bypass register allocation completely and just emit virtual registers,

This is worth a try. It is possible you want to run LLVM's 2-addr, phi-elim, and coalescer passes first.

> 	• Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or

Don't do that.

> 	• Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed

This could also work. Spill slots actually do what you want. The register allocator tries to use as few as possible as long as performance doesn't suffer. Later, StackSlotColoring will merge non-overlapping stack slot ranges to save more space.

> I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code.  Option (2) seems promising, though I worry about the feasibility of the approach.  Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture.
> 
> Any thoughts from the experts out there?  Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.

Sorry to be backwards, but I think you should try (1) or (3).

Simply outputting virtual registers seems like a reasonable thing to to if ptx is really an intermediate form. LLVM's instruction selector and phi-elim tend to emit a lot of copies, so you probably want to run the coalescer before emission. That will minimize the number of copies. This is also the fastest thing you can do.

There are two reasons you may want to run the register allocator anyway:

- Coalescing is very aggressive. It creates long, interfering live ranges. If ptxas doesn't have live range splitting, you may benefit from LLVM's.

- Passes like LICM and CSE will increase register pressure by hoisting redundant computations. If ptxas cannot rematerialize these computations in high register pressure situations, LLVM's register allocator can help you.

Note that if you always make sure there are 'enough' physical registers, the register allocator will never split live ranges or rematerialize computations. That's why (2) doesn't buy you anything over (1).

Use LLVM's register allocator like this:

- Provide a realistic number of physical registers. Make it similar to the target architecture, but aim low.

- Map spill slots to PTX registers. That means 'spilling' is really a noop, except you get live range splitting and remat. If you implement TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be no inserted loads and stores.

The result should be code that is easy to register allocate for ptxas with some live ranges that obviously should go in registers, and some that obviously should spill. There will be a number of live ranges that can go either way, depending on the actual number of registers targeted.

/jakob