<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>This has cropped up before in X86 (<a href="https://bugs.llvm.org/show_bug.cgi?id=26810">https://bugs.llvm.org/show_bug.cgi?id=26810</a> / <a href="https://reviews.llvm.org/rL316295">https://reviews.llvm.org/rL316295</a>), and there's at least a partial mitigation </div><div>(I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch).<br></div><div><br></div><div>I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps.</div><div><br></div><div>-Nirav</div><div dir="ltr"><br></div><div dir="ltr"><br></div></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Preamble<br>
--------<br>
<br>
While working on an IR-level optimisation completely unrelated to register<br>
allocation I happened to trigger some really strange register allocator<br>
behaviour causing a large regression in bzip2 in spec2006. I've been trying<br>
to fix that regression before getting the optimisation patch committed, because<br>
I don't want to regress spec2006, but I'm basically fumbling in the dark because<br>
I don't yet know how or why the register allocator is making the decisions it<br>
does and I thought I'd send an email to see if anyone has any advice.<br>
<br>
<br>
The problem<br>
-----------<br>
<br>
Attached are (zipped, as llvm-dev has a 100kb message limit):<br>
* bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with<br>
some patches that I'm working on) which demonstrates the problem.<br>
* 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes<br>
the problem.<br>
* without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57<br>
bzip2_regression.ll without the patch applied.<br>
* with_patch_regalloc.txt, the same log but with the patch applied.<br>
Note that the patch is not correct, but it happens to be a useful way of<br>
provoking the problem.<br>
<br>
Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks<br>
fine, but with the patch we get this (which comes from the block<br>
bb.17.switchdest13):<br>
<br>
.LBB0_16:<br>
mov x29, x24<br>
mov w24, w20<br>
mov w20, w19<br>
mov w19, w7<br>
mov w7, w6<br>
mov w6, w5<br>
mov w5, w2<br>
mov x2, x18<br>
mov w18, w15<br>
orr w15, wzr, #0x1c<br>
str w15, [x8, #8]<br>
mov w0, wzr<br>
mov w15, w18<br>
mov x18, x2<br>
mov w2, w5<br>
mov w5, w6<br>
mov w6, w7<br>
mov w7, w19<br>
mov w19, w20<br>
mov w20, w24<br>
mov x24, x29<br>
b .LBB0_3<br>
<br>
It looks like the orr and str have barged in and said "we're using w15!" and all<br>
the rest of the registers have meekly moved out of the way and then moved back<br>
again at the and. If the orr and str had used w29 instead then none of this<br>
would have happened.<br>
<br>
What the patch does is make one of the input operands to the JumpTableDest32<br>
pseudo-instruction be not marked as earlyclobber, or in other words it means we<br>
have one extra register free compared to without the patch. And you would<br>
expect that more free registers = better register allocation, but in this case<br>
it appears we don't.<br>
<br>
Note: this problem can happen without the patch, but the test case is much much<br>
larger and manifested itself as -fno-omit-frame-pointer giving a better<br>
allocation than -fomit-frame-pointer. This patch was actually my first attempt<br>
at fixing this (as I'd noticed that we were unnecessarily keeping an extra<br>
register live across the JumpTableDest8).<br>
<br>
<br>
What's going on<br>
---------------<br>
<br>
What this block looks like after live range splitting has happened is:<br>
<br>
7352B bb.17.switchdest13:<br>
; predecessors: %bb.3<br>
successors: %bb.30(0x80000000); %bb.30(100.00%)<br>
<br>
7360B %390:gpr32 = COPY $wzr<br>
7364B %434:gpr64 = COPY %432:gpr64<br>
7368B %429:gpr32 = COPY %427:gpr32<br>
7376B %424:gpr32 = COPY %422:gpr32<br>
7384B %419:gpr32 = COPY %417:gpr32<br>
7392B %414:gpr32 = COPY %412:gpr32<br>
7400B %409:gpr32 = COPY %407:gpr32<br>
7408B %404:gpr32 = COPY %402:gpr32<br>
7416B %399:gpr64 = COPY %397:gpr64<br>
7424B %394:gpr32 = COPY %392:gpr32<br>
7528B %253:gpr32 = MOVi32imm 28<br>
7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8)<br>
7752B %392:gpr32 = COPY %394:gpr32<br>
7756B %397:gpr64 = COPY %399:gpr64<br>
7764B %402:gpr32 = COPY %404:gpr32<br>
7768B %407:gpr32 = COPY %409:gpr32<br>
7776B %412:gpr32 = COPY %414:gpr32<br>
7780B %417:gpr32 = COPY %419:gpr32<br>
7788B %422:gpr32 = COPY %424:gpr32<br>
7792B %427:gpr32 = COPY %429:gpr32<br>
7800B %432:gpr64 = COPY %434:gpr64<br>
7808B %373:gpr64sp = IMPLICIT_DEF<br>
7816B %374:gpr64sp = IMPLICIT_DEF<br>
8048B B %bb.30<br>
<br>
Looking at the debug output of the register allocator, the sequence of events<br>
which kicks things off is<br>
%223 assigned to w0<br>
%283 evicts %381 from w15<br>
%381 requeued for second round<br>
%253 assigned to w15<br>
%381 split for w15 in 4 bundles into %391-%395<br>
%391, %392, %395 are not local intervals<br>
%393 is the local interval for bb.11.switchdest09<br>
%394 is the local interval for bb.17.switchdest13<br>
%392 assigned to w15<br>
%391 evicts %376 from w18<br>
%394 assigned to w18<br>
%376 split into %396-%400<br>
and then %396 evicts something which is split into something which evicts<br>
something etc. until we're done.<br>
<br>
Looking at what happens when this patch isn't applied the difference is:<br>
%223 cannot be assigned to w0, evicts %381 from w15<br>
%381 requeued for second round<br>
%283 assigned to w15<br>
%253 assigned to w15<br>
%381 split for w15 in 1 bundle into %391 and %392<br>
Neither is a local interval<br>
%391 evicts %380 from w2<br>
%392 assigned to w2<br>
<br>
So it looks like the difference is that with the patch we happen to split %381<br>
in a way that causes the split intervals to be allocated such that we get a pair<br>
of copies in bb.17.switchdest13, and this causes a cascade effect where we<br>
repeatedly do the same thing with a whole load of other registers.<br>
<br>
<br>
Possible Solutions<br>
------------------<br>
<br>
So there's two ways I can think of to fix this:<br>
* Make %381 be split in the same way that it is without the patch, which I<br>
think means deciding that there's only 1 bundle for w15. Does anyone know<br>
where and how exactly these bundles are decided?<br>
* Try and change how evicted / split registers are allocated in some way.<br>
Things I've tried:<br>
* In RAGreedy::enqueue reduce the score of unspillable local intervals, and in<br>
RAGreedy::evictInterference put evicted registers into stage RS_Split<br>
immediately. This causes %381 to be split immediately instead of being<br>
requeued, and then makes %391 have a higher score than %253 causing it to<br>
be allocated before it. This works, but ends up causing an extra spill.<br>
* In RAGreedy::splitAroundRegion put global intervals into stage RS_Split<br>
immediately. This makes the chain of evictions after %396 not happen, but<br>
that gives us one extra spill and we still get one pair of copies in<br>
bb.17.switchdest13.<br>
* In RAGreedy::evictInterference put evicted registers into a new RS_Evicted<br>
stage, which is like RS_Assign but can't evict anything. This seemed to give<br>
OK results but was a mess and I didn't understand what I was doing, so I<br>
threw it away.<br>
* Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with<br>
eviction chains like this. Unfortunately it doesn't work as it's a non-local<br>
interval that's causing the eviction chain. I tried making it also handle<br>
non-local intervals, but couldn't figure out how to.<br>
* Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure<br>
why and reading the description of that it may not be the correct solution<br>
(it's described as being an option to reduce the time the register allocator<br>
takes, not to give better allocation). The benchmark results are also<br>
overall slightly worse.<br>
<br>
Any ideas on what the right approach to fixing this is?<br>
<br>
John<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div>