[LLVMdev] Greedy Register Allocation in LLVM 3.0

Jakob Stoklund Olesen stoklund at 2pi.dk
Mon Sep 26 09:49:37 PDT 2011

On Sep 26, 2011, at 4:22 AM, Leo Romanoff wrote:
> 1) Do you have any plans to publish something more formal and detailed about this algorithm, e.g. a paper or something similar? 

I wasn't planning on that, but I might put something together in the long dark California winter nights.

> It would be nice to better understand how this algorithm relates to well-known algorithms described in the literature, e.g. linear scan, graph coloring, iterated register coalescing by Chaitin-Briggs, register allocation by priority-based coloring by Chow and Hennesey and if it represents an extension of one of those algorithms or if it is a totally new algorithm. And of course it would be nice if the terminology used by this description would follow the one published in the literature, as sometimes use of a different terminology can be misleading. For example, it is rather well known that the Linear Scan register allocation of LLVM was not a real linear scan register allocator, as it allowed for backtracking and re-coloring. It was de-facto more of a special case of graph-coloring register allocator. But due to its name, many
> published research papers assumed that it is a real linear scan implementation and took it as a representative example for such allocators.

So much for peer review ;-)

Much of the terminology in LLVM's register allocation framework comes from the linear scan literature, as far as I know.  Sometimes it doesn't make any sense.  For example, our LiveInterval class represents a sorted list of intervals.  Each interval is called a LiveRange.  I am hoping to clean this stuff up gradually.

Most articles on register allocation start with the premise that register allocation is isomorphic to graph coloring.  This is quite wrong in the real world where we must deal with overlapping register classes and aliasing registers.  The terminology is often based on this false premise, and that makes me reluctant to adopt it.

Ideally, I would like to settle on a language that is as close as possible to the established conventions without being misleading.

I don't think you can find any production quality compiler using a 'pure' published register allocator. The devil is in the details, and the published algorithms are too simple.

Anyway, the 'basic' register allocator is quite similar to Chow and Hennessy's priority-based coloring.  They separate the unconstrained live ranges first while basic just drops everything in the priority queue.  Unconstrained live ranges only make sense if you assume graph coloring isomorphism.

You could compare the greedy allocator to George and Appel's iterated register coalescing running in reverse.  Greedy coalesces live ranges aggressively, and splits them on demand.

> 2) A comparison of the quality of generated allocations  and allocator performance with well-known register allocation algorithms would be very interesting to better understand pros and cons of this new algorithm. Do you have any numbers at hand? If not, do you plan to produce a more thorough analysis?

Unfortunately, it would be too much work to make a fair comparison.  As I said, the published algorithms are not good enough for a real compiler. Each would require a lot of work before producing decent results. You would need to work around the missing support for overlapping register classes and register aliases, and you would need to handle pre-colored registers and call-clobbered registers somehow.

The greedy allocator was designed to handle these real-world problems as part of the core algorithm. It was my judgment that adapting something like George and Appel's iterated register coalescing to the real world would be just as much work as starting from scratch.  It was my hope that handling the real-world problems from the start would lead to a cleaner design.

GCC's new integrated register allocator was designed by adapting Chaitin-Briggs to the real world, as far as I know.  That is the most fair comparison I can think of.

> 3) It is stated in the blog that this allocator is much simpler than the old linear scan implementation of LLVM. Can you provide a bit more information?
>    For example:
>     - How many lines of code it is and how it compares in this regard to the old linear scan?

A lot of source files are involved, and some are shared, so I don't have these numbers.

The worst linear scan code is the rewriter which is 2600 lines in VirtRegRewriter.cpp, and parts of LiveIntervalAnalysis.cpp.

The hard part of greedy is the live range splitting code in SplitKit.cpp at 1400 lines.

The rest is of comparable size, but greedy gets you global live range splitting for the same price.

>     - Are there any wired places like the former SimpleRegisterCoalescer, which was not that simple at all and had an extremely wired logic ;-)

Greedy is still using the same coalescer.  It has been renamed RegisterCoalescer for honesty. ;-)

Since greedy handles pre-colored registers as part if its algorithm, the coalescer no longer needs to handle physical register coalescing.  This code can be removed once linear scan is gone.

I suspect that parts of SplitKit.cpp might make people squint. As the author, I find it perfectly clear, of course.

>     - You explained that it produces  in most situations a better code than linear scan, which is cool! But it would be also interesting to know in which pathological cases it still looses against linear scan and why

The regressions I have seen have mostly been luck.  That is, neither heuristic knows what it is doing, and one gets it just right by accident.

I have a vague sense that linear scan is sometimes doing better with some high register pressure coloring problems. It is hard to quantify, though.

>     - Can you provide a rough estimate of the algorithmic complexity of the new algorithm? Is it O(n), O(n^2) or something else?

Something like N log (M), where N is the number of live ranges, and M is the number of continuous live range segments. I wouldn't be surprised if the worst-case behavior is much worse.

The complexity of global live range splitting depends on the topology of the CFG spanned by the live range, but each split is sub-linear to linear in the number of blocks spanned.

Local live range splitting has a known quadratic behavior in large basic blocks that we still need to eliminate. It rarely shows up in practice.


More information about the llvm-dev mailing list