[LLVMdev] Poor register allocation (constants causing spilling)

Tue Jul 14 10:19:23 PDT 2015

Hi,

While investigating a performance issue with an internal codebase I
came across what looks to be poor register allocation.  I have
constructed a small(ish) reproducible which demonstrates the issue
(see test.ll attached).

I have spent some time going through the register allocator to
understand what is happening.  I have also experimented with some
small changes to try and improve the allocation.  Unfortunately, the
full report is fairly long and detailed.  However, in short, I found
that not splitting rematerializable live-ranges lead to significantly
better register allocation, and an overall performance improvement of
3%.

*** The Problem

Compile the attached testcase as follows:

llc -mcpu=btver2 test.ll

Examining the assembly in test.s we can see a constant is being loaded
into %xmm8 (second instruction in foo).  Tracing the constant we can
see the following:

foo:
    ...
    vmovaps    .LCPI0_0(%rip), %xmm8 # xmm8 = [6.366197e-01,6.366197e-01,...]
    ...
    vmulps     %xmm8, %xmm0, %xmm1   # first use of constant
    vmovaps    %xmm8, %xmm9          # move constant into another register
    ...
    vmovaps    %xmm0, -40(%rsp)      # 16-byte Spill
    vmovaps    %xmm9, %xmm0          # move constant into vacated register
    ...
    vmulps     %xmm0, %xmm3, %xmm2   # second use of constant
    ...
    vmulps     %xmm0, %xmm5, %xmm3   # last use of constant
    ...

The constant is used 3 times within a live range of 86 instructions.
Although the constant is clearly rematerializable, the allocator has
gone to some length to keep it in a register, and it has spilled a
value to the stack.  It would have been cheaper to simply fold the
constant load into the 3 uses.

This is not the only example.  Later on we can see this:

    vmovaps .LCPI0_1(%rip), %xmm6   # xmm6 = [2147483648,2147483648,...]
    vxorps  %xmm6, %xmm2, %xmm3
    ...
    vandps  %xmm6, %xmm5, %xmm2
    ...
    vmovaps %xmm1, -56(%rsp)        # 16-byte Spill
    vmovaps %xmm6, %xmm1
    ...
    vmovaps -56(%rsp), %xmm0        # 16-byte Reload
    ...
    vxorps  %xmm1, %xmm3, %xmm4
    ...

Here, we have a spill and reload to keep the constant in a register
for a single use!

*** Live-Range Splitting

Looking at the first constant, we can see that the copies are inserted
by the greedy register allocator when it splits the live range.  The
main entry point is selectOrSplit.  In brief:

1) The constant is assigned and evicted several times.  Eventually
selectOrSplit decides to split the live-range.  This splits the
interval in two, creating two new intervals (A and B).  At the
boundary of the intervals a copy is inserted that copies between the
virtual registers.

2) The virtual register for the first interval (A) is assigned to XMM8.

3) The virtual register for the second interval (B) is assigned and
evicted several times before selectOrSplit decides to split again (C
and D).

4) The virtual register for interval C is assigned to XMM9.

5) The virtual register for interval D is assigned to XMM0.

At the boundary between A and C we have a copy from XMM8 to XMM9, and
at the boundary between C and D we have a copy from XMM9 to XMM0.

The spill is the result of several evictions.  To assign interval D to
XMM0 another virtual register was evicted.  This virtual register had
previously caused an eviction, which was subsequently spilled.

*** Spill Weights

The main question is why does the register allocator decide to split
the constant live range and evict other non-rematerializable
registers?  This comes down to the calculation of the spill weight
which is used to determine whether one interval should evict another.
If the live interval is rematerializable the spill weight is halved
(multiplied by 0.5).  From what I can see, this calculation is the
only place in the greedy register allocator (as opposed to the
spiller) where rematerializability is considered.

So rematerializable intervals should be less important than
non-rematerializable intervals, assuming similar number of uses, size
of range, etc.  However, there appears to be a flaw:

* A rematerializable interval once split is no longer rematerializable *

The isRematerializable check in CalcSpillWeights.cpp uses the target
instruction info to check that the machine instruction for the live
interval definition is trivially rematerializable.  In the case of
interval A above, the definition is a load from the constant-pool.
This is trivially rematerializable and the spill weight is halved.
However, in the remainder intervals B, C and D the definition is a
copy.  This is not trivially rematerializable, and the spill weight is
unadjusted.  This means these ranges are considered as expensive as
non-rematerializable ranges.

*** Experiment One

In constrast to the spill weight calculation, the inline spiller can
trace rematerializability back through the copies resulting from live
range splitting (what it calls sibling values).  The first experiment
modified the greedy register allocator so that it would not split a
rematerializable interval more than once (i.e. instead of splitting
interval B into C and D it is spilled).  The purpose of this
experiment was to confirm that the inline spiller was able to follow
the copy and rematerialize the load from the constant-pool.  In this
case, the load was folded into the two remaining uses (in fact, the
use in interval A was also folded as it was the single remaining use).

*** Experiment Two

This experiment modified the greedy register allocator so that the
spill weights for split rematerializable ranges were adjusted
correctly (a hack rather than a proper fix).  The hope was that with
the "correct" spill weight the relevant priorities of the live ranges
would result in better allocation (see end for full statistics).

Unmodified:

Spilled 3 values to the stack
124 instructions in foo

Modified:

Spilled 1 value to the stack
113 instructions in foo

Unfortunately, the performance showed only minimal improvement.

*** Experiment Three

The code gives no explanation for the spill weight adjustment value of
0.5.  This experiment was designed to determine what adjustment factor
was needed to obtain no spills.  This point was reached with an
adjustment factor for rematerializable intervals of 0.3.

Modified:

No spills to the stack
111 instructions in foo

Again, the performance showed only minimal improvement.

*** Experiment Four

In the previous experiments splitting was still occurring leading to
register copies.  In this final experiment the register allocator was
modified so that rematerializable ranges are not split at all, but are
spilled if they cannot be assigned (obviously, the spiller will
rematerialize the values).

Modified:

No spills to the stack
106 instructions in foo

This version contains considerably more folded loads than the previous
experiments, and shows a 3% speed improvement.

*** Conclusion

I had hoped to get to this point by fixing the spill weight
calculation rather than having to make changes to the register
allocator logic as this seems a more intrusive change.  However,
results from SPEC CPU2006 shows the performance difference of most
tests to be in the noise (<1%).  Depending on architecture (btver2 or
Haswell), libquantum and mcf shows minor improvement.  The only
consistent slow-down (both btver2 and Haswell) is lbm, which shows a
decrease of 1-2%.  Here, I can see that not spilling rematerializable
ranges produces an extra spill in the hot loop (5 rather than 4).
However, the greedy register allocator allocates half the number of
spill slots (5 rather than 10) but when splitting the spiller appears
to be able to remove spills as redundant.

Obviously much more work needs to be done before a patch based on
these experimental results can be submitted.  For example, I have done
no testing on non-X86 targets.  However, at this point I would welcome
opinions and suggestions!

Thanks,
Rob.

*** Register Allocator Statistics

Unmodified:

  32 regalloc      - Number of copies inserted for splitting
   2 regalloc      - Number of folded loads
   3 regalloc      - Number of folded stack accesses
  19 regalloc      - Number of identity moves eliminated after rewriting
  18 regalloc      - Number of instructions deleted by DCE
 158 regalloc      - Number of interferences evicted
 280 regalloc      - Number of new live ranges queued
 311 regalloc      - Number of registers assigned
 173 regalloc      - Number of registers unassigned
   2 regalloc      - Number of reloads inserted
   8 regalloc      - Number of rematerialized defs for spilling
   3 regalloc      - Number of rematerialized defs for splitting
   7 regalloc      - Number of single use loads folded after DCE
   3 regalloc      - Number of spill slots allocated
  13 regalloc      - Number of spilled live ranges
   3 regalloc      - Number of spills inserted
  31 regalloc      - Number of split local live ranges
  31 regalloc      - Number of splits finished
  29 regalloc      - Number of splits that were simple

Experiment Two:

  26 regalloc      - Number of copies inserted for splitting
   9 regalloc      - Number of folded loads
  10 regalloc      - Number of folded stack accesses
  12 regalloc      - Number of identity moves eliminated after rewriting
  21 regalloc      - Number of instructions deleted by DCE
  88 regalloc      - Number of interferences evicted
 181 regalloc      - Number of new live ranges queued
 224 regalloc      - Number of registers assigned
 102 regalloc      - Number of registers unassigned
   4 regalloc      - Number of rematerialized defs for spilling
   8 regalloc      - Number of single use loads folded after DCE
   1 regalloc      - Number of spill slots allocated
  10 regalloc      - Number of spilled live ranges
   4 regalloc      - Number of spilled snippets
   1 regalloc      - Number of spills inserted
  26 regalloc      - Number of split local live ranges
  26 regalloc      - Number of splits finished
  26 regalloc      - Number of splits that were simple

Experiment Three:

  26 regalloc      - Number of copies inserted for splitting
  10 regalloc      - Number of folded loads
  10 regalloc      - Number of folded stack accesses
  13 regalloc      - Number of identity moves eliminated after rewriting
  20 regalloc      - Number of instructions deleted by DCE
  83 regalloc      - Number of interferences evicted
 174 regalloc      - Number of new live ranges queued
 220 regalloc      - Number of registers assigned
  98 regalloc      - Number of registers unassigned
   3 regalloc      - Number of rematerialized defs for spilling
   7 regalloc      - Number of single use loads folded after DCE
   9 regalloc      - Number of spilled live ranges
   4 regalloc      - Number of spilled snippets
  26 regalloc      - Number of split local live ranges
  26 regalloc      - Number of splits finished
  26 regalloc      - Number of splits that were simple

Experiment Four:

  21 regalloc      - Number of folded loads
  21 regalloc      - Number of folded stack accesses
   2 regalloc      - Number of identity moves eliminated after rewriting
   7 regalloc      - Number of instructions deleted by DCE
  42 regalloc      - Number of interferences evicted
  49 regalloc      - Number of new live ranges queued
 148 regalloc      - Number of registers assigned
  42 regalloc      - Number of registers unassigned
   7 regalloc      - Number of spilled live ranges

--
Robert Lougher
SN Systems - Sony Computer Entertainment Group
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.ll
Type: application/octet-stream
Size: 10693 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150714/9759562e/attachment.obj>