[llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation
Nirav Davé via llvm-dev
llvm-dev at lists.llvm.org
Wed Dec 5 09:13:57 PST 2018
This has cropped up before in X86 (
https://bugs.llvm.org/show_bug.cgi?id=26810 /
https://reviews.llvm.org/rL316295), and there's at least a partial
mitigation
(I recently ran into an eviction change on X86 when trying variants of a
MachineScheduler change, but couldn't find a reproduction post the landed
patch).
I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing
if that helps.
-Nirav
On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Preamble
> --------
>
> While working on an IR-level optimisation completely unrelated to register
> allocation I happened to trigger some really strange register allocator
> behaviour causing a large regression in bzip2 in spec2006. I've been trying
> to fix that regression before getting the optimisation patch committed,
> because
> I don't want to regress spec2006, but I'm basically fumbling in the dark
> because
> I don't yet know how or why the register allocator is making the decisions
> it
> does and I thought I'd send an email to see if anyone has any advice.
>
>
> The problem
> -----------
>
> Attached are (zipped, as llvm-dev has a 100kb message limit):
> * bzip2_regression.ll (reduced from bzip2 in spec2006 after being
> compiled with
> some patches that I'm working on) which demonstrates the problem.
> * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which
> causes
> the problem.
> * without_patch_regalloc.txt, the regalloc debug log for llc
> -mcpu=cortex-a57
> bzip2_regression.ll without the patch applied.
> * with_patch_regalloc.txt, the same log but with the patch applied.
> Note that the patch is not correct, but it happens to be a useful way of
> provoking the problem.
>
> Without the patch generating assembly with llc -mcpu=cortex-a57 everything
> looks
> fine, but with the patch we get this (which comes from the block
> bb.17.switchdest13):
>
> .LBB0_16:
> mov x29, x24
> mov w24, w20
> mov w20, w19
> mov w19, w7
> mov w7, w6
> mov w6, w5
> mov w5, w2
> mov x2, x18
> mov w18, w15
> orr w15, wzr, #0x1c
> str w15, [x8, #8]
> mov w0, wzr
> mov w15, w18
> mov x18, x2
> mov w2, w5
> mov w5, w6
> mov w6, w7
> mov w7, w19
> mov w19, w20
> mov w20, w24
> mov x24, x29
> b .LBB0_3
>
> It looks like the orr and str have barged in and said "we're using w15!"
> and all
> the rest of the registers have meekly moved out of the way and then moved
> back
> again at the and. If the orr and str had used w29 instead then none of this
> would have happened.
>
> What the patch does is make one of the input operands to the
> JumpTableDest32
> pseudo-instruction be not marked as earlyclobber, or in other words it
> means we
> have one extra register free compared to without the patch. And you would
> expect that more free registers = better register allocation, but in this
> case
> it appears we don't.
>
> Note: this problem can happen without the patch, but the test case is much
> much
> larger and manifested itself as -fno-omit-frame-pointer giving a better
> allocation than -fomit-frame-pointer. This patch was actually my first
> attempt
> at fixing this (as I'd noticed that we were unnecessarily keeping an extra
> register live across the JumpTableDest8).
>
>
> What's going on
> ---------------
>
> What this block looks like after live range splitting has happened is:
>
> 7352B bb.17.switchdest13:
> ; predecessors: %bb.3
> successors: %bb.30(0x80000000); %bb.30(100.00%)
>
> 7360B %390:gpr32 = COPY $wzr
> 7364B %434:gpr64 = COPY %432:gpr64
> 7368B %429:gpr32 = COPY %427:gpr32
> 7376B %424:gpr32 = COPY %422:gpr32
> 7384B %419:gpr32 = COPY %417:gpr32
> 7392B %414:gpr32 = COPY %412:gpr32
> 7400B %409:gpr32 = COPY %407:gpr32
> 7408B %404:gpr32 = COPY %402:gpr32
> 7416B %399:gpr64 = COPY %397:gpr64
> 7424B %394:gpr32 = COPY %392:gpr32
> 7528B %253:gpr32 = MOVi32imm 28
> 7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106,
> align 8)
> 7752B %392:gpr32 = COPY %394:gpr32
> 7756B %397:gpr64 = COPY %399:gpr64
> 7764B %402:gpr32 = COPY %404:gpr32
> 7768B %407:gpr32 = COPY %409:gpr32
> 7776B %412:gpr32 = COPY %414:gpr32
> 7780B %417:gpr32 = COPY %419:gpr32
> 7788B %422:gpr32 = COPY %424:gpr32
> 7792B %427:gpr32 = COPY %429:gpr32
> 7800B %432:gpr64 = COPY %434:gpr64
> 7808B %373:gpr64sp = IMPLICIT_DEF
> 7816B %374:gpr64sp = IMPLICIT_DEF
> 8048B B %bb.30
>
> Looking at the debug output of the register allocator, the sequence of
> events
> which kicks things off is
> %223 assigned to w0
> %283 evicts %381 from w15
> %381 requeued for second round
> %253 assigned to w15
> %381 split for w15 in 4 bundles into %391-%395
> %391, %392, %395 are not local intervals
> %393 is the local interval for bb.11.switchdest09
> %394 is the local interval for bb.17.switchdest13
> %392 assigned to w15
> %391 evicts %376 from w18
> %394 assigned to w18
> %376 split into %396-%400
> and then %396 evicts something which is split into something which evicts
> something etc. until we're done.
>
> Looking at what happens when this patch isn't applied the difference is:
> %223 cannot be assigned to w0, evicts %381 from w15
> %381 requeued for second round
> %283 assigned to w15
> %253 assigned to w15
> %381 split for w15 in 1 bundle into %391 and %392
> Neither is a local interval
> %391 evicts %380 from w2
> %392 assigned to w2
>
> So it looks like the difference is that with the patch we happen to split
> %381
> in a way that causes the split intervals to be allocated such that we get
> a pair
> of copies in bb.17.switchdest13, and this causes a cascade effect where we
> repeatedly do the same thing with a whole load of other registers.
>
>
> Possible Solutions
> ------------------
>
> So there's two ways I can think of to fix this:
> * Make %381 be split in the same way that it is without the patch, which I
> think means deciding that there's only 1 bundle for w15. Does anyone
> know
> where and how exactly these bundles are decided?
> * Try and change how evicted / split registers are allocated in some way.
> Things I've tried:
> * In RAGreedy::enqueue reduce the score of unspillable local intervals,
> and in
> RAGreedy::evictInterference put evicted registers into stage RS_Split
> immediately. This causes %381 to be split immediately instead of being
> requeued, and then makes %391 have a higher score than %253 causing it
> to
> be allocated before it. This works, but ends up causing an extra spill.
> * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split
> immediately. This makes the chain of evictions after %396 not happen,
> but
> that gives us one extra spill and we still get one pair of copies in
> bb.17.switchdest13.
> * In RAGreedy::evictInterference put evicted registers into a new
> RS_Evicted
> stage, which is like RS_Assign but can't evict anything. This seemed
> to give
> OK results but was a mess and I didn't understand what I was doing, so
> I
> threw it away.
> * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help
> with
> eviction chains like this. Unfortunately it doesn't work as it's a
> non-local
> interval that's causing the eviction chain. I tried making it also
> handle
> non-local intervals, but couldn't figure out how to.
> * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm
> not sure
> why and reading the description of that it may not be the correct
> solution
> (it's described as being an option to reduce the time the register
> allocator
> takes, not to give better allocation). The benchmark results are also
> overall slightly worse.
>
> Any ideas on what the right approach to fixing this is?
>
> John
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181205/f2fcce9c/attachment.html>
More information about the llvm-dev
mailing list