[LLVMdev] Poor register allocation (constants causing spilling)
Philip Reames
listmail at philipreames.com
Tue Jul 14 21:52:18 PDT 2015
Robert,
Thanks for sharing your findings and analysis. For the moment, I'll
leave the technical discussion for Quentin and those more familiar with
the regalloc code, but I wanted to express interest in getting this
fixed and say thanks.
Has there been a bug filed for this yet?
Philip
On 07/14/2015 10:19 AM, Robert Lougher wrote:
> Hi,
>
> While investigating a performance issue with an internal codebase I
> came across what looks to be poor register allocation. I have
> constructed a small(ish) reproducible which demonstrates the issue
> (see test.ll attached).
>
> I have spent some time going through the register allocator to
> understand what is happening. I have also experimented with some
> small changes to try and improve the allocation. Unfortunately, the
> full report is fairly long and detailed. However, in short, I found
> that not splitting rematerializable live-ranges lead to significantly
> better register allocation, and an overall performance improvement of
> 3%.
>
> *** The Problem
>
> Compile the attached testcase as follows:
>
> llc -mcpu=btver2 test.ll
>
> Examining the assembly in test.s we can see a constant is being loaded
> into %xmm8 (second instruction in foo). Tracing the constant we can
> see the following:
>
> foo:
> ...
> vmovaps .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...]
> ...
> vmulps %xmm8, %xmm0, %xmm1 # first use of constant
> vmovaps %xmm8, %xmm9 # move constant into another register
> ...
> vmovaps %xmm0, -40(%rsp) # 16-byte Spill
> vmovaps %xmm9, %xmm0 # move constant into vacated register
> ...
> vmulps %xmm0, %xmm3, %xmm2 # second use of constant
> ...
> vmulps %xmm0, %xmm5, %xmm3 # last use of constant
> ...
>
> The constant is used 3 times within a live range of 86 instructions.
> Although the constant is clearly rematerializable, the allocator has
> gone to some length to keep it in a register, and it has spilled a
> value to the stack. It would have been cheaper to simply fold the
> constant load into the 3 uses.
>
> This is not the only example. Later on we can see this:
>
> vmovaps .LCPI0_1(%rip), %xmm6 # xmm6 = [2147483648,2147483648,...]
> vxorps %xmm6, %xmm2, %xmm3
> ...
> vandps %xmm6, %xmm5, %xmm2
> ...
> vmovaps %xmm1, -56(%rsp) # 16-byte Spill
> vmovaps %xmm6, %xmm1
> ...
> vmovaps -56(%rsp), %xmm0 # 16-byte Reload
> ...
> vxorps %xmm1, %xmm3, %xmm4
> ...
>
> Here, we have a spill and reload to keep the constant in a register
> for a single use!
>
> *** Live-Range Splitting
>
> Looking at the first constant, we can see that the copies are inserted
> by the greedy register allocator when it splits the live range. The
> main entry point is selectOrSplit. In brief:
>
> 1) The constant is assigned and evicted several times. Eventually
> selectOrSplit decides to split the live-range. This splits the
> interval in two, creating two new intervals (A and B). At the
> boundary of the intervals a copy is inserted that copies between the
> virtual registers.
>
> 2) The virtual register for the first interval (A) is assigned to XMM8.
>
> 3) The virtual register for the second interval (B) is assigned and
> evicted several times before selectOrSplit decides to split again (C
> and D).
>
> 4) The virtual register for interval C is assigned to XMM9.
>
> 5) The virtual register for interval D is assigned to XMM0.
>
> At the boundary between A and C we have a copy from XMM8 to XMM9, and
> at the boundary between C and D we have a copy from XMM9 to XMM0.
>
> The spill is the result of several evictions. To assign interval D to
> XMM0 another virtual register was evicted. This virtual register had
> previously caused an eviction, which was subsequently spilled.
>
> *** Spill Weights
>
> The main question is why does the register allocator decide to split
> the constant live range and evict other non-rematerializable
> registers? This comes down to the calculation of the spill weight
> which is used to determine whether one interval should evict another.
> If the live interval is rematerializable the spill weight is halved
> (multiplied by 0.5). From what I can see, this calculation is the
> only place in the greedy register allocator (as opposed to the
> spiller) where rematerializability is considered.
>
> So rematerializable intervals should be less important than
> non-rematerializable intervals, assuming similar number of uses, size
> of range, etc. However, there appears to be a flaw:
>
> * A rematerializable interval once split is no longer rematerializable *
>
> The isRematerializable check in CalcSpillWeights.cpp uses the target
> instruction info to check that the machine instruction for the live
> interval definition is trivially rematerializable. In the case of
> interval A above, the definition is a load from the constant-pool.
> This is trivially rematerializable and the spill weight is halved.
> However, in the remainder intervals B, C and D the definition is a
> copy. This is not trivially rematerializable, and the spill weight is
> unadjusted. This means these ranges are considered as expensive as
> non-rematerializable ranges.
>
> *** Experiment One
>
> In constrast to the spill weight calculation, the inline spiller can
> trace rematerializability back through the copies resulting from live
> range splitting (what it calls sibling values). The first experiment
> modified the greedy register allocator so that it would not split a
> rematerializable interval more than once (i.e. instead of splitting
> interval B into C and D it is spilled). The purpose of this
> experiment was to confirm that the inline spiller was able to follow
> the copy and rematerialize the load from the constant-pool. In this
> case, the load was folded into the two remaining uses (in fact, the
> use in interval A was also folded as it was the single remaining use).
>
> *** Experiment Two
>
> This experiment modified the greedy register allocator so that the
> spill weights for split rematerializable ranges were adjusted
> correctly (a hack rather than a proper fix). The hope was that with
> the "correct" spill weight the relevant priorities of the live ranges
> would result in better allocation (see end for full statistics).
>
> Unmodified:
>
> Spilled 3 values to the stack
> 124 instructions in foo
>
> Modified:
>
> Spilled 1 value to the stack
> 113 instructions in foo
>
> Unfortunately, the performance showed only minimal improvement.
>
> *** Experiment Three
>
> The code gives no explanation for the spill weight adjustment value of
> 0.5. This experiment was designed to determine what adjustment factor
> was needed to obtain no spills. This point was reached with an
> adjustment factor for rematerializable intervals of 0.3.
>
> Modified:
>
> No spills to the stack
> 111 instructions in foo
>
> Again, the performance showed only minimal improvement.
>
> *** Experiment Four
>
> In the previous experiments splitting was still occurring leading to
> register copies. In this final experiment the register allocator was
> modified so that rematerializable ranges are not split at all, but are
> spilled if they cannot be assigned (obviously, the spiller will
> rematerialize the values).
>
> Modified:
>
> No spills to the stack
> 106 instructions in foo
>
> This version contains considerably more folded loads than the previous
> experiments, and shows a 3% speed improvement.
>
> *** Conclusion
>
> I had hoped to get to this point by fixing the spill weight
> calculation rather than having to make changes to the register
> allocator logic as this seems a more intrusive change. However,
> results from SPEC CPU2006 shows the performance difference of most
> tests to be in the noise (<1%). Depending on architecture (btver2 or
> Haswell), libquantum and mcf shows minor improvement. The only
> consistent slow-down (both btver2 and Haswell) is lbm, which shows a
> decrease of 1-2%. Here, I can see that not spilling rematerializable
> ranges produces an extra spill in the hot loop (5 rather than 4).
> However, the greedy register allocator allocates half the number of
> spill slots (5 rather than 10) but when splitting the spiller appears
> to be able to remove spills as redundant.
>
> Obviously much more work needs to be done before a patch based on
> these experimental results can be submitted. For example, I have done
> no testing on non-X86 targets. However, at this point I would welcome
> opinions and suggestions!
>
> Thanks,
> Rob.
>
> *** Register Allocator Statistics
>
> Unmodified:
>
> 32 regalloc - Number of copies inserted for splitting
> 2 regalloc - Number of folded loads
> 3 regalloc - Number of folded stack accesses
> 19 regalloc - Number of identity moves eliminated after rewriting
> 18 regalloc - Number of instructions deleted by DCE
> 158 regalloc - Number of interferences evicted
> 280 regalloc - Number of new live ranges queued
> 311 regalloc - Number of registers assigned
> 173 regalloc - Number of registers unassigned
> 2 regalloc - Number of reloads inserted
> 8 regalloc - Number of rematerialized defs for spilling
> 3 regalloc - Number of rematerialized defs for splitting
> 7 regalloc - Number of single use loads folded after DCE
> 3 regalloc - Number of spill slots allocated
> 13 regalloc - Number of spilled live ranges
> 3 regalloc - Number of spills inserted
> 31 regalloc - Number of split local live ranges
> 31 regalloc - Number of splits finished
> 29 regalloc - Number of splits that were simple
>
> Experiment Two:
>
> 26 regalloc - Number of copies inserted for splitting
> 9 regalloc - Number of folded loads
> 10 regalloc - Number of folded stack accesses
> 12 regalloc - Number of identity moves eliminated after rewriting
> 21 regalloc - Number of instructions deleted by DCE
> 88 regalloc - Number of interferences evicted
> 181 regalloc - Number of new live ranges queued
> 224 regalloc - Number of registers assigned
> 102 regalloc - Number of registers unassigned
> 4 regalloc - Number of rematerialized defs for spilling
> 8 regalloc - Number of single use loads folded after DCE
> 1 regalloc - Number of spill slots allocated
> 10 regalloc - Number of spilled live ranges
> 4 regalloc - Number of spilled snippets
> 1 regalloc - Number of spills inserted
> 26 regalloc - Number of split local live ranges
> 26 regalloc - Number of splits finished
> 26 regalloc - Number of splits that were simple
>
> Experiment Three:
>
> 26 regalloc - Number of copies inserted for splitting
> 10 regalloc - Number of folded loads
> 10 regalloc - Number of folded stack accesses
> 13 regalloc - Number of identity moves eliminated after rewriting
> 20 regalloc - Number of instructions deleted by DCE
> 83 regalloc - Number of interferences evicted
> 174 regalloc - Number of new live ranges queued
> 220 regalloc - Number of registers assigned
> 98 regalloc - Number of registers unassigned
> 3 regalloc - Number of rematerialized defs for spilling
> 7 regalloc - Number of single use loads folded after DCE
> 9 regalloc - Number of spilled live ranges
> 4 regalloc - Number of spilled snippets
> 26 regalloc - Number of split local live ranges
> 26 regalloc - Number of splits finished
> 26 regalloc - Number of splits that were simple
>
> Experiment Four:
>
> 21 regalloc - Number of folded loads
> 21 regalloc - Number of folded stack accesses
> 2 regalloc - Number of identity moves eliminated after rewriting
> 7 regalloc - Number of instructions deleted by DCE
> 42 regalloc - Number of interferences evicted
> 49 regalloc - Number of new live ranges queued
> 148 regalloc - Number of registers assigned
> 42 regalloc - Number of registers unassigned
> 7 regalloc - Number of spilled live ranges
>
> --
> Robert Lougher
> SN Systems - Sony Computer Entertainment Group
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150714/dcf3c410/attachment.html>
More information about the llvm-dev
mailing list