[LLVMdev] Fast register allocation

Jakob Stoklund Olesen stoklund at 2pi.dk
Mon May 17 17:43:13 PDT 2010


As you may have noticed, I have been working on a fast register allocator in the last week. The new allocator is intended to replace the local allocator for debug builds.

Both allocators use a local allocation strategy. Each basic block is scanned from top to bottom, and virtual registers are assigned to physical registers as they appear. There are no live registers between blocks. Everything is spilled at the end of each block. This strategy works well for unoptimized code where live ranges are short.

The fast allocator uses a few tricks to run faster than the local allocator while producing slightly better code.

When RALocal allocates a physical register, it first checks if any aliases are in use to avoid clobbering a subregister. RAFast uses a working set strategy where recently killed registers are kept around for quick allocation without the alias scan.

RALocal uses a least-recently-used algorithm for allocating registers. RAFast uses a whatever-I-can-get-my-hands-on-quickly algorithm.

RALocal folds spills and restores into other instructions. RAFast doesn't bother because debug builds have very few spills.

RAFast uses more aggressive hinting by peeking at future instructions. This helps reduce the number of register copies when setting up parameters for a call:

	%reg1028<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
	%reg1029<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
	ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
	%reg1030<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
	%reg1031<def> = MOV8r0 %EFLAGS<imp-def,dead>
	%RDI<def> = MOV64rr %reg1030
	%ESI<def> = MOV32rr %reg1028
	%EDX<def> = MOV32rr %reg1029
	%AL<def> = MOV8rr %reg1031
	CALL64pcrel32 <ga:@printf>, %RDI, %ESI, %EDX, %AL, %RAX<imp-def>, %RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ...

When finding a register for %reg1028, RAFast sees that it is later copied to %ESI. It is allocated %ESI and the copy disappears:

	%ESI<def> = MOV32rm <fi#2>, 1, %reg0, 0, %reg0
	%EDX<def> = MOV32rm <fi#2>, 1, %reg0, 4, %reg0
	ADJCALLSTACKDOWN64 0, %RSP<imp-def>, %EFLAGS<imp-def>, %RSP<imp-use>
	%RDI<def> = LEA64r %RIP, 1, %reg0, <ga:@.str>
	%AL<def> = MOV8r0 %EFLAGS<imp-def,dead>
	CALL64pcrel32 <ga:@printf>, %RDI<kill>, %ESI<kill>, %EDX<kill>, %AL<kill>, %RAX<imp-def>, %RDX<imp-def>, %RSI<imp-def>, %RDI<imp-def>, %RSP<imp-use>, ... 

The aggressive coalescing also reduces register pressure, and there is a lot less spilling.
In a debug build of SPECCPU2006/403.gcc it results in 8% less instructions:

 Local    Fast
------- -------
 153623  258567 regalloc     - Number of copies coalesced
  24975   18381 regalloc     - Number of loads added
  19361   13864 regalloc     - Number of stores added
1109880 1018144 asm-printer  - Number of machine instrs printed

The fast allocator is about three times faster than the old local allocator.
It reduces the overall code generation time from 7.0s to 5.9s for 403.gcc:

RALocal: Total Execution Time: 6.9976 seconds (7.1717 wall clock)

---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
2.7376 ( 42.1%)   0.2554 ( 51.8%)   2.9931 ( 42.8%)   2.9930 ( 41.7%)  X86 DAG->DAG Instruction Selection
1.4644 ( 22.5%)   0.0073 (  1.5%)   1.4717 ( 21.0%)   1.4714 ( 20.5%)  Local Register Allocator
0.5407 (  8.3%)   0.1298 ( 26.3%)   0.6705 (  9.6%)   0.8463 ( 11.8%)  X86 AT&T-Style Assembly Printer

RAFast: Total Execution Time: 5.9255 seconds (6.0734 wall clock)

---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
2.7223 ( 50.1%)   0.2574 ( 52.5%)   2.9796 ( 50.3%)   2.9796 ( 49.1%)  X86 DAG->DAG Instruction Selection
0.5241 (  9.6%)   0.1237 ( 25.2%)   0.6479 ( 10.9%)   0.8003 ( 13.2%)  X86 AT&T-Style Assembly Printer
0.4526 (  8.3%)   0.0063 (  1.3%)   0.4589 (  7.7%)   0.4584 (  7.5%)  Fast Register Allocator


The fast allocator can be enabled for clang -O0 with a small patch:

diff --git a/lib/Frontend/CodeGenAction.cpp b/lib/Frontend/CodeGenAction.cpp
index 86005f2..cd2d2f0 100644
--- a/lib/Frontend/CodeGenAction.cpp
+++ b/lib/Frontend/CodeGenAction.cpp
@@ -322,7 +322,7 @@ bool BackendConsumer::AddEmitPasses() {
 
   // Set register scheduler & allocation policy.
   RegisterScheduler::setDefault(createDefaultScheduler);
-  RegisterRegAlloc::setDefault(Fast ? createLocalRegisterAllocator :
+  RegisterRegAlloc::setDefault(Fast ? createFastRegisterAllocator :
                                createLinearScanRegisterAllocator);
 
   // Create the code generator passes.

I have tested clang self hosting on x86-64 and x64 with this patch.
On the nightly test suite, -regalloc=fast performs as well as -regalloc=local. See PR7154.

I expect we may see some problems with the register scavenger on ARM because -regalloc=fast doesn't currently add <imp-def,dead> flags on call clobbered registers. That increases register pressure quite a bit as seen by the scavenger.

Share and Enjoy!
/jakob





More information about the llvm-dev mailing list