[LLVMdev] TargetRegisterInfo and "infinite" register files

Mon May 16 11:27:26 PDT 2011

On Mon, May 16, 2011 at 1:00 PM, Jakob Stoklund Olesen <stoklund at 2pi.dk>wrote:

>
> On May 16, 2011, at 6:52 AM, Justin Holewinski wrote:
>
> > Currently, the TableGen register info files for all of the back-ends
> define concrete registers and divide them into logical register classes.  I
> would like to get some input from the LLVM experts around here on how best
> to map this model to an architecture that does *not* have a concrete,
> pre-defined register file.  The architecture is PTX, which is more of an
> intermediate form than a final assembly language.  The format is essentially
> three-address code, with "virtual" registers instead of "physical"
> registers.  After PTX code generation, the PTX assembly is compiled to a
> device binary with a proprietary tool (ptxas) that does final register
> allocation (based on device and user constraints).  However, exploiting
> register re-use at the LLVM/PTX level has shown performance improvement over
> blindly using a new "physical" register for each def and letting ptxas
> figure out all of the register allocation details, so I would like to take
> advantage of the LLVM register allocation infrastructure if at all possible.
>
> What kind of optimizations can ptxas do? Can it hoist computations out of
> loops? Can it split live ranges? Can it coalesce live ranges?
>

Part of my problem is that ptxas is proprietary software, so it's
essentially a black box to me.  It appears to do a reasonable job, but I've
also seen cases where PTX-level register re-use led to better device
register utilization.

>
> > Generally stated, I would like to solve the register allocation problem
> as "allocate the minimum number of registers from an arbitrary set without
> spill code" instead of the more traditional "allocate the minimum number of
> registers from a fixed set."
>
> It's a common misconception, but that is not what LLVM's register
> allocators do. They try to minimize the amount of executed spill code given
> the fixed set of registers.
>
> I wouldn't recommend dynamically growing the register file. You are likely
> to get super-linear compile time, and it is not clear that register
> allocation would achieve anything compared to simply outputting virtual
> registers. Surely, ptxas' register allocator can reuse a register for
> non-overlapping live ranges. That is all you would get out of this.
>

That makes sense.  If the LLVM register allocators do not actively try to
minimize register usage, then I see how there would not be a win here.

>
> > The current implementation defines an arbitrary set of registers that the
> register allocator can use during code-gen.  This works, but is not
> scalable.  If the register allocator runs out of registers, spill code must
> be generated.  However, the "optimal" solution in this case would be to
> extend the register file.  A few alternatives I have come up with are:
> >       • Bypass register allocation completely and just emit virtual
> registers,
>
> This is worth a try. It is possible you want to run LLVM's 2-addr,
> phi-elim, and coalescer passes first.
>

I definitely need to look into those passes some more.  I just hesitate to
ignore the LLVM register allocator since I have seen it generate better
final code (post-ptxas).

>
> >       • Remove register definitions from the TableGen files and create
> them at run-time using the virtual register counts as an upper bound on the
> number of registers needed, or
>
> Don't do that.
>

I see now why that would be sub-optimal.

>
> >       • Keep a small set of pre-defined physical registers, and craft
> spill code that really just puts a new register definition in the final PTX
> and copies to/from this register when spilling/restoring is needed
>
> This could also work. Spill slots actually do what you want. The register
> allocator tries to use as few as possible as long as performance doesn't
> suffer. Later, StackSlotColoring will merge non-overlapping stack slot
> ranges to save more space.
>

That's good to know.  I was hoping LLVM did something like that, but I have
not really scanned that code too thoroughly yet.

>
> > I hesitate to use (1) or (3) as they rely too heavily on the final ptxas
> tool to perform reasonable register allocation, which may not lead to
> optimal code.  Option (2) seems promising, though I worry about the
> feasibility of the approach.  Specifically, I am not yet sure if generating
> TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit
> into the existing architecture.
> >
> > Any thoughts from the experts out there?  Specifically, I am interested
> in any non-trivial pros/cons for any of these approaches, or any new
> approaches I have not considered.
>
> Sorry to be backwards, but I think you should try (1) or (3).
>
> Simply outputting virtual registers seems like a reasonable thing to to if
> ptx is really an intermediate form. LLVM's instruction selector and phi-elim
> tend to emit a lot of copies, so you probably want to run the coalescer
> before emission. That will minimize the number of copies. This is also the
> fastest thing you can do.
>
> There are two reasons you may want to run the register allocator anyway:
>
> - Coalescing is very aggressive. It creates long, interfering live ranges.
> If ptxas doesn't have live range splitting, you may benefit from LLVM's.
>
> - Passes like LICM and CSE will increase register pressure by hoisting
> redundant computations. If ptxas cannot rematerialize these computations in
> high register pressure situations, LLVM's register allocator can help you.
>
> Note that if you always make sure there are 'enough' physical registers,
> the register allocator will never split live ranges or rematerialize
> computations. That's why (2) doesn't buy you anything over (1).
>

Interesting.  I was working under the assumption that the register
allocators tried to minimize register use.

>
> Use LLVM's register allocator like this:
>
> - Provide a realistic number of physical registers. Make it similar to the
> target architecture, but aim low.
>

Sounds reasonable.

>
> - Map spill slots to PTX registers. That means 'spilling' is really a noop,
> except you get live range splitting and remat. If you implement
> TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be
> no inserted loads and stores.
>

That's good to know.

>
> The result should be code that is easy to register allocate for ptxas with
> some live ranges that obviously should go in registers, and some that
> obviously should spill. There will be a number of live ranges that can go
> either way, depending on the actual number of registers targeted.
>

This was definitely very informative!  Thanks for the information!

>
> /jakob
>
>

-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20110516/168378de/attachment.html>