<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>This has cropped up before in X86 (<a href="https://bugs.llvm.org/show_bug.cgi?id=26810">https://bugs.llvm.org/show_bug.cgi?id=26810</a> / <a href="https://reviews.llvm.org/rL316295">https://reviews.llvm.org/rL316295</a>), and there's at least a partial mitigation </div><div>(I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch).<br></div><div><br></div><div>I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps.</div><div><br></div><div>-Nirav</div><div dir="ltr"><br></div><div dir="ltr"><br></div></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Preamble<br>

--------<br>

<br>

While working on an IR-level optimisation completely unrelated to register<br>

allocation I happened to trigger some really strange register allocator<br>

behaviour causing a large regression in bzip2 in spec2006. I've been trying<br>

to fix that regression before getting the optimisation patch committed, because<br>

I don't want to regress spec2006, but I'm basically fumbling in the dark because<br>

I don't yet know how or why the register allocator is making the decisions it<br>

does and I thought I'd send an email to see if anyone has any advice.<br>

<br>

<br>

The problem<br>

-----------<br>

<br>

Attached are (zipped, as llvm-dev has a 100kb message limit):<br>

 * bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with<br>

   some patches that I'm working on) which demonstrates the problem.<br>

 * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes<br>

   the problem.<br>

 * without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57<br>

   bzip2_regression.ll without the patch applied.<br>

 * with_patch_regalloc.txt, the same log but with the patch applied.<br>

Note that the patch is not correct, but it happens to be a useful way of<br>

provoking the problem.<br>

<br>

Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks<br>

fine, but with the patch we get this (which comes from the block<br>

bb.17.switchdest13):<br>

<br>

.LBB0_16:<br>

        mov     x29, x24<br>

        mov     w24, w20<br>

        mov     w20, w19<br>

        mov     w19, w7<br>

        mov     w7, w6<br>

        mov     w6, w5<br>

        mov     w5, w2<br>

        mov     x2, x18<br>

        mov     w18, w15<br>

        orr     w15, wzr, #0x1c<br>

        str     w15, [x8, #8]<br>

        mov     w0, wzr<br>

        mov     w15, w18<br>

        mov     x18, x2<br>

        mov     w2, w5<br>

        mov     w5, w6<br>

        mov     w6, w7<br>

        mov     w7, w19<br>

        mov     w19, w20<br>

        mov     w20, w24<br>

        mov     x24, x29<br>

        b       .LBB0_3<br>

<br>

It looks like the orr and str have barged in and said "we're using w15!" and all<br>

the rest of the registers have meekly moved out of the way and then moved back<br>

again at the and. If the orr and str had used w29 instead then none of this<br>

would have happened.<br>

<br>

What the patch does is make one of the input operands to the JumpTableDest32<br>

pseudo-instruction be not marked as earlyclobber, or in other words it means we<br>

have one extra register free compared to without the patch. And you would<br>

expect that more free registers = better register allocation, but in this case<br>

it appears we don't.<br>

<br>

Note: this problem can happen without the patch, but the test case is much much<br>

larger and manifested itself as -fno-omit-frame-pointer giving a better<br>

allocation than -fomit-frame-pointer. This patch was actually my first attempt<br>

at fixing this (as I'd noticed that we were unnecessarily keeping an extra<br>

register live across the JumpTableDest8).<br>

<br>

<br>

What's going on<br>

---------------<br>

<br>

What this block looks like after live range splitting has happened is:<br>

<br>

  7352B bb.17.switchdest13:<br>

        ; predecessors: %bb.3<br>

          successors: %bb.30(0x80000000); %bb.30(100.00%)<br>

<br>

  7360B   %390:gpr32 = COPY $wzr<br>

  7364B   %434:gpr64 = COPY %432:gpr64<br>

  7368B   %429:gpr32 = COPY %427:gpr32<br>

  7376B   %424:gpr32 = COPY %422:gpr32<br>

  7384B   %419:gpr32 = COPY %417:gpr32<br>

  7392B   %414:gpr32 = COPY %412:gpr32<br>

  7400B   %409:gpr32 = COPY %407:gpr32<br>

  7408B   %404:gpr32 = COPY %402:gpr32<br>

  7416B   %399:gpr64 = COPY %397:gpr64<br>

  7424B   %394:gpr32 = COPY %392:gpr32<br>

  7528B   %253:gpr32 = MOVi32imm 28<br>

  7536B   STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8)<br>

  7752B   %392:gpr32 = COPY %394:gpr32<br>

  7756B   %397:gpr64 = COPY %399:gpr64<br>

  7764B   %402:gpr32 = COPY %404:gpr32<br>

  7768B   %407:gpr32 = COPY %409:gpr32<br>

  7776B   %412:gpr32 = COPY %414:gpr32<br>

  7780B   %417:gpr32 = COPY %419:gpr32<br>

  7788B   %422:gpr32 = COPY %424:gpr32<br>

  7792B   %427:gpr32 = COPY %429:gpr32<br>

  7800B   %432:gpr64 = COPY %434:gpr64<br>

  7808B   %373:gpr64sp = IMPLICIT_DEF<br>

  7816B   %374:gpr64sp = IMPLICIT_DEF<br>

  8048B   B %bb.30<br>

<br>

Looking at the debug output of the register allocator, the sequence of events<br>

which kicks things off is<br>

 %223 assigned to w0<br>

 %283 evicts %381 from w15<br>

 %381 requeued for second round<br>

 %253 assigned to w15<br>

 %381 split for w15 in 4 bundles into %391-%395<br>

  %391, %392, %395 are not local intervals<br>

  %393 is the local interval for bb.11.switchdest09<br>

  %394 is the local interval for bb.17.switchdest13<br>

 %392 assigned to w15<br>

 %391 evicts %376 from w18<br>

 %394 assigned to w18<br>

 %376 split into %396-%400<br>

and then %396 evicts something which is split into something which evicts<br>

something etc. until we're done.<br>

<br>

Looking at what happens when this patch isn't applied the difference is:<br>

 %223 cannot be assigned to w0, evicts %381 from w15<br>

 %381 requeued for second round<br>

 %283 assigned to w15<br>

 %253 assigned to w15<br>

 %381 split for w15 in 1 bundle into %391 and %392<br>

  Neither is a local interval<br>

 %391 evicts %380 from w2<br>

 %392 assigned to w2<br>

<br>

So it looks like the difference is that with the patch we happen to split %381<br>

in a way that causes the split intervals to be allocated such that we get a pair<br>

of copies in bb.17.switchdest13, and this causes a cascade effect where we<br>

repeatedly do the same thing with a whole load of other registers.<br>

<br>

<br>

Possible Solutions<br>

------------------<br>

<br>

So there's two ways I can think of to fix this:<br>

 * Make %381 be split in the same way that it is without the patch, which I<br>

   think means deciding that there's only 1 bundle for w15. Does anyone know<br>

   where and how exactly these bundles are decided?<br>

 * Try and change how evicted / split registers are allocated in some way.<br>

   Things I've tried:<br>

  * In RAGreedy::enqueue reduce the score of unspillable local intervals, and in<br>

    RAGreedy::evictInterference put evicted registers into stage RS_Split<br>

    immediately. This causes %381 to be split immediately instead of being<br>

    requeued, and then makes %391 have a higher score than %253 causing it to<br>

    be allocated before it. This works, but ends up causing an extra spill.<br>

  * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split<br>

    immediately. This makes the chain of evictions after %396 not happen, but<br>

    that gives us one extra spill and we still get one pair of copies in<br>

    bb.17.switchdest13.<br>

  * In RAGreedy::evictInterference put evicted registers into a new RS_Evicted<br>

    stage, which is like RS_Assign but can't evict anything. This seemed to give<br>

    OK results but was a mess and I didn't understand what I was doing, so I<br>

    threw it away.<br>

  * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with<br>

    eviction chains like this. Unfortunately it doesn't work as it's a non-local<br>

    interval that's causing the eviction chain. I tried making it also handle<br>

    non-local intervals, but couldn't figure out how to.<br>

  * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure<br>

    why and reading the description of that it may not be the correct solution<br>

    (it's described as being an option to reduce the time the register allocator<br>

    takes, not to give better allocation). The benchmark results are also<br>

    overall slightly worse.<br>

<br>

Any ideas on what the right approach to fixing this is?<br>

<br>

John<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div>