[LLVMdev] Poor register allocation (constants causing spilling)

Tue Jul 14 21:52:18 PDT 2015

Robert,

Thanks for sharing your findings and analysis.  For the moment, I'll 
leave the technical discussion for Quentin and those more familiar with 
the regalloc code, but I wanted to express interest in getting this 
fixed and say thanks.

Has there been a bug filed for this yet?

Philip

On 07/14/2015 10:19 AM, Robert Lougher wrote:
> Hi,
>
> While investigating a performance issue with an internal codebase I
> came across what looks to be poor register allocation.  I have
> constructed a small(ish) reproducible which demonstrates the issue
> (see test.ll attached).
>
> I have spent some time going through the register allocator to
> understand what is happening.  I have also experimented with some
> small changes to try and improve the allocation.  Unfortunately, the
> full report is fairly long and detailed.  However, in short, I found
> that not splitting rematerializable live-ranges lead to significantly
> better register allocation, and an overall performance improvement of
> 3%.
>
> *** The Problem
>
> Compile the attached testcase as follows:
>
> llc -mcpu=btver2 test.ll
>
> Examining the assembly in test.s we can see a constant is being loaded
> into %xmm8 (second instruction in foo).  Tracing the constant we can
> see the following:
>
> foo:
>      ...
>      vmovaps    .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...]
>      ...
>      vmulps     %xmm8, %xmm0, %xmm1   # first use of constant
>      vmovaps    %xmm8, %xmm9          # move constant into another register
>      ...
>      vmovaps    %xmm0, -40(%rsp)      # 16-byte Spill
>      vmovaps    %xmm9, %xmm0          # move constant into vacated register
>      ...
>      vmulps     %xmm0, %xmm3, %xmm2   # second use of constant
>      ...
>      vmulps     %xmm0, %xmm5, %xmm3   # last use of constant
>      ...
>
> The constant is used 3 times within a live range of 86 instructions.
> Although the constant is clearly rematerializable, the allocator has
> gone to some length to keep it in a register, and it has spilled a
> value to the stack.  It would have been cheaper to simply fold the
> constant load into the 3 uses.
>
> This is not the only example.  Later on we can see this:
>
>      vmovaps .LCPI0_1(%rip), %xmm6   # xmm6 = [2147483648,2147483648,...]
>      vxorps  %xmm6, %xmm2, %xmm3
>      ...
>      vandps  %xmm6, %xmm5, %xmm2
>      ...
>      vmovaps %xmm1, -56(%rsp)        # 16-byte Spill
>      vmovaps %xmm6, %xmm1
>      ...
>      vmovaps -56(%rsp), %xmm0        # 16-byte Reload
>      ...
>      vxorps  %xmm1, %xmm3, %xmm4
>      ...
>
> Here, we have a spill and reload to keep the constant in a register
> for a single use!
>
> *** Live-Range Splitting
>
> Looking at the first constant, we can see that the copies are inserted
> by the greedy register allocator when it splits the live range.  The
> main entry point is selectOrSplit.  In brief:
>
> 1) The constant is assigned and evicted several times.  Eventually
> selectOrSplit decides to split the live-range.  This splits the
> interval in two, creating two new intervals (A and B).  At the
> boundary of the intervals a copy is inserted that copies between the
> virtual registers.
>
> 2) The virtual register for the first interval (A) is assigned to XMM8.
>
> 3) The virtual register for the second interval (B) is assigned and
> evicted several times before selectOrSplit decides to split again (C
> and D).
>
> 4) The virtual register for interval C is assigned to XMM9.
>
> 5) The virtual register for interval D is assigned to XMM0.
>
> At the boundary between A and C we have a copy from XMM8 to XMM9, and
> at the boundary between C and D we have a copy from XMM9 to XMM0.
>
> The spill is the result of several evictions.  To assign interval D to
> XMM0 another virtual register was evicted.  This virtual register had
> previously caused an eviction, which was subsequently spilled.
>
> *** Spill Weights
>
> The main question is why does the register allocator decide to split
> the constant live range and evict other non-rematerializable
> registers?  This comes down to the calculation of the spill weight
> which is used to determine whether one interval should evict another.
> If the live interval is rematerializable the spill weight is halved
> (multiplied by 0.5).  From what I can see, this calculation is the
> only place in the greedy register allocator (as opposed to the
> spiller) where rematerializability is considered.
>
> So rematerializable intervals should be less important than
> non-rematerializable intervals, assuming similar number of uses, size
> of range, etc.  However, there appears to be a flaw:
>
> * A rematerializable interval once split is no longer rematerializable *
>
> The isRematerializable check in CalcSpillWeights.cpp uses the target
> instruction info to check that the machine instruction for the live
> interval definition is trivially rematerializable.  In the case of
> interval A above, the definition is a load from the constant-pool.
> This is trivially rematerializable and the spill weight is halved.
> However, in the remainder intervals B, C and D the definition is a
> copy.  This is not trivially rematerializable, and the spill weight is
> unadjusted.  This means these ranges are considered as expensive as
> non-rematerializable ranges.
>
> *** Experiment One
>
> In constrast to the spill weight calculation, the inline spiller can
> trace rematerializability back through the copies resulting from live
> range splitting (what it calls sibling values).  The first experiment
> modified the greedy register allocator so that it would not split a
> rematerializable interval more than once (i.e. instead of splitting
> interval B into C and D it is spilled).  The purpose of this
> experiment was to confirm that the inline spiller was able to follow
> the copy and rematerialize the load from the constant-pool.  In this
> case, the load was folded into the two remaining uses (in fact, the
> use in interval A was also folded as it was the single remaining use).
>
> *** Experiment Two
>
> This experiment modified the greedy register allocator so that the
> spill weights for split rematerializable ranges were adjusted
> correctly (a hack rather than a proper fix).  The hope was that with
> the "correct" spill weight the relevant priorities of the live ranges
> would result in better allocation (see end for full statistics).
>
> Unmodified:
>
> Spilled 3 values to the stack
> 124 instructions in foo
>
> Modified:
>
> Spilled 1 value to the stack
> 113 instructions in foo
>
> Unfortunately, the performance showed only minimal improvement.
>
> *** Experiment Three
>
> The code gives no explanation for the spill weight adjustment value of
> 0.5.  This experiment was designed to determine what adjustment factor
> was needed to obtain no spills.  This point was reached with an
> adjustment factor for rematerializable intervals of 0.3.
>
> Modified:
>
> No spills to the stack
> 111 instructions in foo
>
> Again, the performance showed only minimal improvement.
>
> *** Experiment Four
>
> In the previous experiments splitting was still occurring leading to
> register copies.  In this final experiment the register allocator was
> modified so that rematerializable ranges are not split at all, but are
> spilled if they cannot be assigned (obviously, the spiller will
> rematerialize the values).
>
> Modified:
>
> No spills to the stack
> 106 instructions in foo
>
> This version contains considerably more folded loads than the previous
> experiments, and shows a 3% speed improvement.
>
> *** Conclusion
>
> I had hoped to get to this point by fixing the spill weight
> calculation rather than having to make changes to the register
> allocator logic as this seems a more intrusive change.  However,
> results from SPEC CPU2006 shows the performance difference of most
> tests to be in the noise (<1%).  Depending on architecture (btver2 or
> Haswell), libquantum and mcf shows minor improvement.  The only
> consistent slow-down (both btver2 and Haswell) is lbm, which shows a
> decrease of 1-2%.  Here, I can see that not spilling rematerializable
> ranges produces an extra spill in the hot loop (5 rather than 4).
> However, the greedy register allocator allocates half the number of
> spill slots (5 rather than 10) but when splitting the spiller appears
> to be able to remove spills as redundant.
>
> Obviously much more work needs to be done before a patch based on
> these experimental results can be submitted.  For example, I have done
> no testing on non-X86 targets.  However, at this point I would welcome
> opinions and suggestions!
>
> Thanks,
> Rob.
>
> *** Register Allocator Statistics
>
> Unmodified:
>
>    32 regalloc      - Number of copies inserted for splitting
>     2 regalloc      - Number of folded loads
>     3 regalloc      - Number of folded stack accesses
>    19 regalloc      - Number of identity moves eliminated after rewriting
>    18 regalloc      - Number of instructions deleted by DCE
>   158 regalloc      - Number of interferences evicted
>   280 regalloc      - Number of new live ranges queued
>   311 regalloc      - Number of registers assigned
>   173 regalloc      - Number of registers unassigned
>     2 regalloc      - Number of reloads inserted
>     8 regalloc      - Number of rematerialized defs for spilling
>     3 regalloc      - Number of rematerialized defs for splitting
>     7 regalloc      - Number of single use loads folded after DCE
>     3 regalloc      - Number of spill slots allocated
>    13 regalloc      - Number of spilled live ranges
>     3 regalloc      - Number of spills inserted
>    31 regalloc      - Number of split local live ranges
>    31 regalloc      - Number of splits finished
>    29 regalloc      - Number of splits that were simple
>
> Experiment Two:
>
>    26 regalloc      - Number of copies inserted for splitting
>     9 regalloc      - Number of folded loads
>    10 regalloc      - Number of folded stack accesses
>    12 regalloc      - Number of identity moves eliminated after rewriting
>    21 regalloc      - Number of instructions deleted by DCE
>    88 regalloc      - Number of interferences evicted
>   181 regalloc      - Number of new live ranges queued
>   224 regalloc      - Number of registers assigned
>   102 regalloc      - Number of registers unassigned
>     4 regalloc      - Number of rematerialized defs for spilling
>     8 regalloc      - Number of single use loads folded after DCE
>     1 regalloc      - Number of spill slots allocated
>    10 regalloc      - Number of spilled live ranges
>     4 regalloc      - Number of spilled snippets
>     1 regalloc      - Number of spills inserted
>    26 regalloc      - Number of split local live ranges
>    26 regalloc      - Number of splits finished
>    26 regalloc      - Number of splits that were simple
>
> Experiment Three:
>
>    26 regalloc      - Number of copies inserted for splitting
>    10 regalloc      - Number of folded loads
>    10 regalloc      - Number of folded stack accesses
>    13 regalloc      - Number of identity moves eliminated after rewriting
>    20 regalloc      - Number of instructions deleted by DCE
>    83 regalloc      - Number of interferences evicted
>   174 regalloc      - Number of new live ranges queued
>   220 regalloc      - Number of registers assigned
>    98 regalloc      - Number of registers unassigned
>     3 regalloc      - Number of rematerialized defs for spilling
>     7 regalloc      - Number of single use loads folded after DCE
>     9 regalloc      - Number of spilled live ranges
>     4 regalloc      - Number of spilled snippets
>    26 regalloc      - Number of split local live ranges
>    26 regalloc      - Number of splits finished
>    26 regalloc      - Number of splits that were simple
>
> Experiment Four:
>
>    21 regalloc      - Number of folded loads
>    21 regalloc      - Number of folded stack accesses
>     2 regalloc      - Number of identity moves eliminated after rewriting
>     7 regalloc      - Number of instructions deleted by DCE
>    42 regalloc      - Number of interferences evicted
>    49 regalloc      - Number of new live ranges queued
>   148 regalloc      - Number of registers assigned
>    42 regalloc      - Number of registers unassigned
>     7 regalloc      - Number of spilled live ranges
>
> --
> Robert Lougher
> SN Systems - Sony Computer Entertainment Group
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150714/dcf3c410/attachment.html>