[LLVMdev] Whole-function isel

Mon Mar 24 23:38:18 PDT 2008

On Mar 24, 2008, at 10:47 PM, Christopher Lamb wrote:
> I know that this has been discussed (at least in passing) a few  
> times on the list, but I couldn't locate a bug for it. Have any  
> architectural plans been made for it?

Funny you bring this up.  Evan and I were tossing around crazy ideas  
about this just today.  If you're interested, maybe we should get  
together for lunch or coffee or something.

> Are there architectural roadblocks with the current LLVM  
> infrastructure that will complicate the process? What has  
> demotivated the implementation of this so far (is it not that big a  
> benefit on important targets, too much time/effort for the payoff)?

There is a huge payoff to doing this or even making progress on it.   
It is an elegant solution to many issues.  There are two current  
problems:

1) There are some scalability issues (in compile time) with large  
selectiondags.  Roman has been tackling this recently, so this will  
hopefully be a non-issue really soon.  In any case, it would be very  
reasonable continue doing single MBB selectiondags when in "-fast"  
mode (i.e. -O0) and do whole function ones when compile time is worth  
it (-O2+ etc).

2) The bigger issue is one of scheduling.  The great thing about whole  
function isel is that nothing up to scheduling needs to care about  
what block something is in (as long as chain nodes pin side-effecting  
instrs to their block of course).  Things like machine-code loop  
invariant code motion, CSE of values from different blocks, sinking  
and many other details are easily handled, and directly in the  
selection dags: this is really cool! :)  The problem is that  
*something has to decide what block to put all the operations in.  The  
most logical thing to do here is to have a pass either part of sched  
or right before it that assigns a block to each node (which is a  
target instr node) and then have sched remain a single-bb at a time  
(for now, it could obviously be improved later to work on traces etc).

The problem is that in full generality, this is a problem very similar  
to PRE: assigning blocks to operations optimally is complex and  
somewhat subtle.  An excellent place to start looking at this is Cliff  
Click's thesis, which is available online (and incidentally inspired a  
lot of selection dag design).  In his thesis, he has a hacky solution  
that is probably good enough, certainly to start out with.

The reason this hasn't been done so far is that we (people at apple)  
haven't been able to justify the "big project" of tackling this when  
there are still tons of little projects that are lower hanging for our  
needs. We will almost certainly do this someday (and would have  
already done it if it were blocking some project of ours), but would  
love to have someone work on it in the shorter term.  The thing that  
Evan and I were discussing today is that there are a variety of ways  
to get intermediate results without tackling the full generality of  
the problem:

1) One annoying problem for targets like PPC and ARM (which promote i8/ 
i16 to i32) can be solved with a really simple hack.  Consider a  
switch on i16 that codegens to a branch tree.  This turns into a bunch  
of small blocks: the first zext's the value then does an unsigned  
comparison against the range typically, and each subsequent block does  
an unsigned comparison against the subportion of the range.  The  
problem is that right now, the subsequent comparisons just see that  
the i16 value is promoted, which means the top bits are unknown.  This  
means that each block ends up doing  tons of redundant "and"s to clear  
out the top bits before their comparison.  This is obviously a special  
case of a general problem: uses of values in blocks that aren't the  
def can't see properties of the value.

This (and similar problems) boil down to not knowing info about a  
cross-block promoted value. This sort of thing can be solved with a  
very simple hack by adding a few maps to SDISel.  The idea is that, in  
ISel, each cross-block CopyToReg for a Vreg can call ComputeMaskedBits  
and ComputeSignBits and record the known live out bit+sign information  
for the vreg.  Later, if someone calls ComputeMaskedBits or  
ComputeSignBits and it recurses up to a CopyFromReg from the vreg, the  
information can be provided from its definition.

This elegantly and cheaply solves the problem.  The reason I call it a  
hack is that it means you get different code for a function based on  
the order that blocks are selected (i.e. it is better to compile the  
defs of values before the uses), which is annoying, but actually  
probably just fine in practice.  Codegen remains deterministic of  
course.

2) An intermediate step to whole function isel is to do region based  
isel with regions limited by what makes the sched problem simple.  One  
very interesting class of region is a Single-Entry-Multiple-Exit  
region.  It would be straight-forward to break each function into SEME  
regions and select them one at a time, allowing them to compile to  
multiple MBBs.

In addition to the obvious dag combiner and isel improvements that  
would come from larger regions, SEME regions encapsulate a number of  
interesting things (such as switch lowering).  Having SEME regions  
would allow us to remove the block juggling in SDISel: switches would  
just lower to multiple blocks in one dag.  SEME regions are also  
totally trivial for the sched problem: because your region is just a  
tree, all you have to do is move a computation "up" the tree towards  
the root to the place that dominates all its uses.  Another  
interesting thing about SEME regions is that you don't have to worry  
about phi nodes etc.

OTOH, SEME regions are still limiting, so we would eventually want to  
do whole function (or at least "arbitrary region") isel.  SEME regions  
do handle "sinking" well, but don't handle LICM (because you can't  
have a whole loop in the region) and don't handle the diamond pattern  
than "select" lowers to when a target doesn't support cmov.  Wouldn't  
it be nice if legalize could lower select directly for targets (by  
inserting blocks into the DAG), instead of having to do the "custom  
sched inserter" hack? :)  Wouldn't it be nice if dag combine could  
actually itself form cmovs from diamond shapes in the cfg? :)

At the end of the day, attacking a subclass of the "big problem" is a  
really interesting way (to me) to incrementally tackle this problem,  
and I strongly encourage anyone interested in this to start with SEME  
regions first.  This allows us to figure out things like "how do we  
represent the start of a bb" (probably a new chained node), "how does  
the sched pass set the block that an instruction goes in" (probably a  
new field in SDNode), and allows us to do some cleanups (e.g. switch  
lowering) without tackling some of the other hard problems.  Once SEME  
regions are working really well, we can tackle the rest of the hard  
problems, because they are obviously worthwhile doing eventually.

> In toying with some code generators I've been frustrated in getting  
> the code gen prepare pass + isel to implement the heuristics I need  
> to match some addressing modes and I believe that whole-function  
> isel might be the answer.

I would love to see any progress in this area.  It is clearly an  
important thing to tackle, and it is blocking other interesting  
improvements in the code generator.  It would also allow us to  
eliminate a significant amount of weirdness that exists to hack around  
this (e.g. switch lowering).  That said, even when we *can* do whole  
function isel, I think we should keep the ability to handle subregions  
of the function for -O0, at least until SD stuff scales perfectly :).

-Chris