[PATCH] D17235: [Greedy Regalloc] Reg splitting based on loops

Fri Feb 12 21:40:25 PST 2016

wmi created this revision.
wmi added a reviewer: qcolombet.
wmi added subscribers: llvm-commits, davidxl.
wmi set the repository for this revision to rL LLVM.
Herald added subscribers: qcolombet, MatzeB.

The patch is to add reg splitting based on loops, especially hot loops. It is a supplement to current region splitting.

During analysis of PR26597, PR24278 and PR24348, we find some typical senarios which cannot be well handled by current region splitting.

* Vreg1 has reference inside loop1 and lives through loop2. Vreg2 is both defined and used inside loop2. Vreg1 gets a physical register before Vreg2's allocation. Loop2 has high register pressure, so vreg2 cannot find available physical register. Vreg1 has higher weight than vreg2 because it is used in loop1 (The hotness of loop1 and loop2 are comparable), so vreg2 cannot evict vreg1 and can only be spilled.
  However, it is better to split vreg1 around loop2. If vreg1 is splitted around loop2, the vreg1' generated around loop2 will only have define and use at the loop preheader and exit block (cold compared with loop body), which means it has low weight and can be evicted by vreg2. After vreg2 evicts vreg1', vreg2 will get physical register and there is less spill in loop2.
  Current region splitting will not try splitting vreg1 when allocating for vreg2 -- it will only try splitting vreg2.  
  PR26597 belongs to this case.

* Both vreg1 and vreg2 have references inside a loop. The reference of vreg2 in the loop can be rematerialized but vreg1 cannot, which indicates vreg2 should have lower weight than vreg1.  But in reality vreg2 still has higher weight than vreg1 because it has shorter live range than vreg1, which is another factor affecting live range weight.
  After vreg2 gets physical register, vreg1 cannot evict vreg2 because it has lower weight. vreg1 cannot be rematerialized so there will be a spill inserted in the loop.
  For this case, it is also better to split vreg1 around loop, so vreg1' generated by split around loop will have higher weight than vreg2 because it is shorter. vreg1' will evict vreg2 and vreg2 reference in loop will be rematerialized. There will be less spill in loop2.  
  Current region splitting will also split vreg1, but the splitting point is inside the loop. That is because vreg1 has reference inside loop not interfering with a physical register. In order to let vreg1 references inside loop be allocated to physical register as many as possible, region based spliitting choose to split inside loop. Although the splitted vreg1' inside loop evict vreg2 and get physical register in the end, there are extra reg movs left inside the loop.    
  PR24278 belongs to this case.

* Vreg1 has only 1 reference in loop1 but vreg2 has 4 reference in loop1. vreg1 also has many reference in loop2 so overall vreg1 has higher weight than vreg2. Loop1 has high register pressure and only one physical register is left for vreg1 and vreg2. Finally vreg1 get the physical register because it has higher weight. However, it is unoptimal for loop1 because vreg2 actually has more reference in loop1.    
   For this case, it is better to split both vreg1 and vreg2 around loop1, so the weight of vreg1' and vreg2' around the loop after split can be computed all based on the references inside loop1, and then vreg2' will have higher weight than vreg1' and it will get the physical register.  
   Current region splitting only tries to find holes from the live ranges of physical register being used. It will not try the splitting here.
   This is an assumed case. I didn't try to catch such testcase from testsuites but I think it probably exists.

>From the cases analysis above, to optimize register allocation for a hot loop with high register pressure to the best, we may want to split all the virtregs which live across the loop at the loop boundary. In such way, the hot loop will become an independent regalloc region isolated from regalloc outside the region. The patch here implements such splitting.

The logic to implement splitting around loop has two major steps. The first step is: If a vreg lives across a hot loop and it has real reference inside the loop, we will only split the vreg around the loop. If the new vreg around loop generated after splitting still cannot get physical register, we will try the second step - to split all the vregs living across the loop once in all. The intuition of the first step is, if the new vreg around loop after splitting still cannot get physical register, it tells us definitively the loop has high register pressure, and only then we will try the second step.

Another part necessary is to split critical edges for loop entry and exit so that the splitting can happen at the loop boundary. A separate pass is added to do that. I find it is also helpful for phi-elimination and shrinkwrapping.

Performance on x86-64:
0.5% improvement on avg for google internal benchmarks.
Some improvements on llvm testsuite, no apparent regression saw.
  SingleSource/Benchmarks/Shootout/matrix: 10.54%
  SingleSource/Benchmarks/Adobe-C++/stepanov_vector: 3.41%
  MultiSource/Benchmarks/sim/sim: 7.31%
  MultiSource/Applications/siod/siod: 9.59%
  (The above four are improved because of reg splitting around loop)
  SingleSource/Benchmarks/Shootout-C++/ackermann: 21.3%
  SingleSource/Benchmarks/BenchmarkGame/recursive: 11.02%
  (These two are improved because splitting critical edges enables more shrinkwrapping)
  MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk: 1.73%
  (This testcase is improved because after splitting critical edges, phi-elimination can generate a copy outside loop)

Compile time:
There is 1.12% compile time increase for spec2006 all C/C++ benchmarks. For one file I looked at, the increased compile is majorly spent on propagateSiblingValues. When I use -split-spill-mode=size for both base and comparison, the compile time increase drops to 0.8%. I will work on it and try to reduce the compile time increase more. 

Problems to solve - Better hints handling.
Loop splitting introduces more vreg copies at the loop boundary. It depends on register allocation to coalesce such copies using hints. However, current hint setting and using in regalloc is not optimal.
A typical case we found is like this: vreg2 and vreg3 are splitted from vreg1. vreg1 has a hint of R8 so vreg2 and vreg3 both have hint of R8. However vreg2 finally gets R10 assigned and then vreg3 also gets R10 because its hint changes from R8 to R10. At this time, the hint of vreg2 is still R8. Depriving R10 from vreg2 will not be regarded as a BrokenHint, which leads to easy deprivation of R10 from vreg2. The copy between vreg2 and vreg3 cannot be removed.
The case we saw was at the loop boundary in a cold outer loop so it didn't cause perf regression, but it is possible that it happened elsewhere I didn't notice.
The hint handling problem exposed here is a general issue and may worth some effort to improve.  

Repository:
  rL LLVM

http://reviews.llvm.org/D17235

Files:
  include/llvm/CodeGen/Passes.h
  include/llvm/InitializePasses.h
  lib/CodeGen/CMakeLists.txt
  lib/CodeGen/CodeGen.cpp
  lib/CodeGen/Passes.cpp
  lib/CodeGen/RegAllocGreedy.cpp
  lib/CodeGen/SplitCritEdges.cpp
  lib/CodeGen/SplitKit.cpp
  lib/CodeGen/SplitKit.h

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D17235.47896.patch
Type: text/x-patch
Size: 33370 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160213/687ca206/attachment-0001.bin>