<br><br><div class="gmail_quote">On Mon, May 16, 2011 at 1:00 PM, Jakob Stoklund Olesen <span dir="ltr"><<a href="mailto:stoklund@2pi.dk">stoklund@2pi.dk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im"><br>

On May 16, 2011, at 6:52 AM, Justin Holewinski wrote:<br>

<br>

> Currently, the TableGen register info files for all of the back-ends define concrete registers and divide them into logical register classes.  I would like to get some input from the LLVM experts around here on how best to map this model to an architecture that does *not* have a concrete, pre-defined register file.  The architecture is PTX, which is more of an intermediate form than a final assembly language.  The format is essentially three-address code, with "virtual" registers instead of "physical" registers.  After PTX code generation, the PTX assembly is compiled to a device binary with a proprietary tool (ptxas) that does final register allocation (based on device and user constraints).  However, exploiting register re-use at the LLVM/PTX level has shown performance improvement over blindly using a new "physical" register for each def and letting ptxas figure out all of the register allocation details, so I would like to take advantage of the LLVM register allocation infrastructure if at all possible.<br>


<br>

</div>What kind of optimizations can ptxas do? Can it hoist computations out of loops? Can it split live ranges? Can it coalesce live ranges?<br></blockquote><div><br>Part of my problem is that ptxas is proprietary software, so it's essentially a black box to me.  It appears to do a reasonable job, but I've also seen cases where PTX-level register re-use led to better device register utilization.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div class="im"><br>

> Generally stated, I would like to solve the register allocation problem as "allocate the minimum number of registers from an arbitrary set without spill code" instead of the more traditional "allocate the minimum number of registers from a fixed set."<br>


<br>

</div>It's a common misconception, but that is not what LLVM's register allocators do. They try to minimize the amount of executed spill code given the fixed set of registers.<br>

<br>

I wouldn't recommend dynamically growing the register file. You are likely to get super-linear compile time, and it is not clear that register allocation would achieve anything compared to simply outputting virtual registers. Surely, ptxas' register allocator can reuse a register for non-overlapping live ranges. That is all you would get out of this.<br>

</blockquote><div><br></div><div>That makes sense.  If the LLVM register allocators do not actively try to minimize register usage, then I see how there would not be a win here.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div class="im"><br>

> The current implementation defines an arbitrary set of registers that the register allocator can use during code-gen.  This works, but is not scalable.  If the register allocator runs out of registers, spill code must be generated.  However, the "optimal" solution in this case would be to extend the register file.  A few alternatives I have come up with are:<br>


>       • Bypass register allocation completely and just emit virtual registers,<br>

<br>

</div>This is worth a try. It is possible you want to run LLVM's 2-addr, phi-elim, and coalescer passes first.<br></blockquote><div><br></div><div>I definitely need to look into those passes some more.  I just hesitate to ignore the LLVM register allocator since I have seen it generate better final code (post-ptxas).</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

>       • Remove register definitions from the TableGen files and create them at run-time using the virtual register counts as an upper bound on the number of registers needed, or<br>

<br>

Don't do that.<br></blockquote><div><br></div><div>I see now why that would be sub-optimal.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

>       • Keep a small set of pre-defined physical registers, and craft spill code that really just puts a new register definition in the final PTX and copies to/from this register when spilling/restoring is needed<br>


<br>

This could also work. Spill slots actually do what you want. The register allocator tries to use as few as possible as long as performance doesn't suffer. Later, StackSlotColoring will merge non-overlapping stack slot ranges to save more space.<br>

</blockquote><div><br></div><div>That's good to know.  I was hoping LLVM did something like that, but I have not really scanned that code too thoroughly yet.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div class="im"><br>

> I hesitate to use (1) or (3) as they rely too heavily on the final ptxas tool to perform reasonable register allocation, which may not lead to optimal code.  Option (2) seems promising, though I worry about the feasibility of the approach.  Specifically, I am not yet sure if generating TargetRegisterInfo and TargetRegisterClass instances on-the-fly will fit into the existing architecture.<br>


><br>

> Any thoughts from the experts out there?  Specifically, I am interested in any non-trivial pros/cons for any of these approaches, or any new approaches I have not considered.<br>

<br>

</div>Sorry to be backwards, but I think you should try (1) or (3).<br>

<br>

Simply outputting virtual registers seems like a reasonable thing to to if ptx is really an intermediate form. LLVM's instruction selector and phi-elim tend to emit a lot of copies, so you probably want to run the coalescer before emission. That will minimize the number of copies. This is also the fastest thing you can do.<br>


<br>

There are two reasons you may want to run the register allocator anyway:<br>

<br>

- Coalescing is very aggressive. It creates long, interfering live ranges. If ptxas doesn't have live range splitting, you may benefit from LLVM's.<br>

<br>

- Passes like LICM and CSE will increase register pressure by hoisting redundant computations. If ptxas cannot rematerialize these computations in high register pressure situations, LLVM's register allocator can help you.<br>


<br>

Note that if you always make sure there are 'enough' physical registers, the register allocator will never split live ranges or rematerialize computations. That's why (2) doesn't buy you anything over (1).<br>

</blockquote><div><br></div><div>Interesting.  I was working under the assumption that the register allocators tried to minimize register use.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<br>

Use LLVM's register allocator like this:<br>

<br>

- Provide a realistic number of physical registers. Make it similar to the target architecture, but aim low.<br></blockquote><div><br></div><div>Sounds reasonable.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<br>

- Map spill slots to PTX registers. That means 'spilling' is really a noop, except you get live range splitting and remat. If you implement TII::canFoldMemoryOperand() and TII::foldMemoryOperandImpl(), there will be no inserted loads and stores.<br>

</blockquote><div><br></div><div>That's good to know.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

The result should be code that is easy to register allocate for ptxas with some live ranges that obviously should go in registers, and some that obviously should spill. There will be a number of live ranges that can go either way, depending on the actual number of registers targeted.<br>

</blockquote><div><br></div><div>This was definitely very informative!  Thanks for the information!</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<font color="#888888"><br>

/jakob<br>

<br>

</font></blockquote></div><br><br clear="all"><br>-- <br><br><div>Thanks,</div><div><br></div><div>Justin Holewinski</div><br>