<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 14 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


@font-face


        {font-family:Tahoma;


        panose-1:2 11 6 4 3 5 4 4 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0cm;


        margin-bottom:.0001pt;


        font-size:12.0pt;


        font-family:"Times New Roman","serif";}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


a:visited, span.MsoHyperlinkFollowed


        {mso-style-priority:99;


        color:purple;


        text-decoration:underline;}


span.EmailStyle17


        {mso-style-type:personal-reply;


        font-family:"Calibri","sans-serif";


        color:#1F497D;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-family:"Calibri","sans-serif";


        mso-fareast-language:EN-US;}


@page WordSection1


        {size:612.0pt 792.0pt;


        margin:72.0pt 72.0pt 72.0pt 72.0pt;}


div.WordSection1


        {page:WordSection1;}


--></style><!--[if gte mso 9]><xml>


<o:shapedefaults v:ext="edit" spidmax="1026" />


</xml><![endif]--><!--[if gte mso 9]><xml>


<o:shapelayout v:ext="edit">


<o:idmap v:ext="edit" data="1" />


</o:shapelayout></xml><![endif]-->


</head>


<body lang="EN-GB" link="blue" vlink="purple">


<div class="WordSection1">


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">enableAdvancedRASplitCost() does the same thing as ConsiderLocalIntervalCost, but as a<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">subtarget option instead of a command-line option, and as I’ve said it doesn’t help because<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">it’s a non-local interval causing the eviction chain (RAGreedy::splitCanCauseEvictionChain<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">only considers the local interval for a single block, and it’s unclear to me how to make it<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">handle a non-local interval).<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">John<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>


<div style="border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt">


<div>


<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">


<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> Nirav Davé [mailto:niravd@google.com]


<br>


<b>Sent:</b> 05 December 2018 17:14<br>


<b>To:</b> John Brawn<br>


<b>Cc:</b> llvm-dev; nd<br>


<b>Subject:</b> Re: [llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation<o:p></o:p></span></p>


</div>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<div>


<div>


<div>


<div>


<p class="MsoNormal">This has cropped up before in X86 (<a href="https://bugs.llvm.org/show_bug.cgi?id=26810">https://bugs.llvm.org/show_bug.cgi?id=26810</a> /


<a href="https://reviews.llvm.org/rL316295">https://reviews.llvm.org/rL316295</a>), and there's at least a partial mitigation <o:p></o:p></p>


</div>


<div>


<p class="MsoNormal">(I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch).<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps.<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal">-Nirav<o:p></o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


<div>


<p class="MsoNormal"><o:p> </o:p></p>


</div>


</div>


</div>


</div>


</div>


</div>


<p class="MsoNormal"><o:p> </o:p></p>


<div>


<div>


<p class="MsoNormal">On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<o:p></o:p></p>


</div>


<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">


<p class="MsoNormal">Preamble<br>


--------<br>


<br>


While working on an IR-level optimisation completely unrelated to register<br>


allocation I happened to trigger some really strange register allocator<br>


behaviour causing a large regression in bzip2 in spec2006. I've been trying<br>


to fix that regression before getting the optimisation patch committed, because<br>


I don't want to regress spec2006, but I'm basically fumbling in the dark because<br>


I don't yet know how or why the register allocator is making the decisions it<br>


does and I thought I'd send an email to see if anyone has any advice.<br>


<br>


<br>


The problem<br>


-----------<br>


<br>


Attached are (zipped, as llvm-dev has a 100kb message limit):<br>


 * bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with<br>


   some patches that I'm working on) which demonstrates the problem.<br>


 * 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes<br>


   the problem.<br>


 * without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57<br>


   bzip2_regression.ll without the patch applied.<br>


 * with_patch_regalloc.txt, the same log but with the patch applied.<br>


Note that the patch is not correct, but it happens to be a useful way of<br>


provoking the problem.<br>


<br>


Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks<br>


fine, but with the patch we get this (which comes from the block<br>


bb.17.switchdest13):<br>


<br>


.LBB0_16:<br>


        mov     x29, x24<br>


        mov     w24, w20<br>


        mov     w20, w19<br>


        mov     w19, w7<br>


        mov     w7, w6<br>


        mov     w6, w5<br>


        mov     w5, w2<br>


        mov     x2, x18<br>


        mov     w18, w15<br>


        orr     w15, wzr, #0x1c<br>


        str     w15, [x8, #8]<br>


        mov     w0, wzr<br>


        mov     w15, w18<br>


        mov     x18, x2<br>


        mov     w2, w5<br>


        mov     w5, w6<br>


        mov     w6, w7<br>


        mov     w7, w19<br>


        mov     w19, w20<br>


        mov     w20, w24<br>


        mov     x24, x29<br>


        b       .LBB0_3<br>


<br>


It looks like the orr and str have barged in and said "we're using w15!" and all<br>


the rest of the registers have meekly moved out of the way and then moved back<br>


again at the and. If the orr and str had used w29 instead then none of this<br>


would have happened.<br>


<br>


What the patch does is make one of the input operands to the JumpTableDest32<br>


pseudo-instruction be not marked as earlyclobber, or in other words it means we<br>


have one extra register free compared to without the patch. And you would<br>


expect that more free registers = better register allocation, but in this case<br>


it appears we don't.<br>


<br>


Note: this problem can happen without the patch, but the test case is much much<br>


larger and manifested itself as -fno-omit-frame-pointer giving a better<br>


allocation than -fomit-frame-pointer. This patch was actually my first attempt<br>


at fixing this (as I'd noticed that we were unnecessarily keeping an extra<br>


register live across the JumpTableDest8).<br>


<br>


<br>


What's going on<br>


---------------<br>


<br>


What this block looks like after live range splitting has happened is:<br>


<br>


  7352B bb.17.switchdest13:<br>


        ; predecessors: %bb.3<br>


          successors: %bb.30(0x80000000); %bb.30(100.00%)<br>


<br>


  7360B   %390:gpr32 = COPY $wzr<br>


  7364B   %434:gpr64 = COPY %432:gpr64<br>


  7368B   %429:gpr32 = COPY %427:gpr32<br>


  7376B   %424:gpr32 = COPY %422:gpr32<br>


  7384B   %419:gpr32 = COPY %417:gpr32<br>


  7392B   %414:gpr32 = COPY %412:gpr32<br>


  7400B   %409:gpr32 = COPY %407:gpr32<br>


  7408B   %404:gpr32 = COPY %402:gpr32<br>


  7416B   %399:gpr64 = COPY %397:gpr64<br>


  7424B   %394:gpr32 = COPY %392:gpr32<br>


  7528B   %253:gpr32 = MOVi32imm 28<br>


  7536B   STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8)<br>


  7752B   %392:gpr32 = COPY %394:gpr32<br>


  7756B   %397:gpr64 = COPY %399:gpr64<br>


  7764B   %402:gpr32 = COPY %404:gpr32<br>


  7768B   %407:gpr32 = COPY %409:gpr32<br>


  7776B   %412:gpr32 = COPY %414:gpr32<br>


  7780B   %417:gpr32 = COPY %419:gpr32<br>


  7788B   %422:gpr32 = COPY %424:gpr32<br>


  7792B   %427:gpr32 = COPY %429:gpr32<br>


  7800B   %432:gpr64 = COPY %434:gpr64<br>


  7808B   %373:gpr64sp = IMPLICIT_DEF<br>


  7816B   %374:gpr64sp = IMPLICIT_DEF<br>


  8048B   B %bb.30<br>


<br>


Looking at the debug output of the register allocator, the sequence of events<br>


which kicks things off is<br>


 %223 assigned to w0<br>


 %283 evicts %381 from w15<br>


 %381 requeued for second round<br>


 %253 assigned to w15<br>


 %381 split for w15 in 4 bundles into %391-%395<br>


  %391, %392, %395 are not local intervals<br>


  %393 is the local interval for bb.11.switchdest09<br>


  %394 is the local interval for bb.17.switchdest13<br>


 %392 assigned to w15<br>


 %391 evicts %376 from w18<br>


 %394 assigned to w18<br>


 %376 split into %396-%400<br>


and then %396 evicts something which is split into something which evicts<br>


something etc. until we're done.<br>


<br>


Looking at what happens when this patch isn't applied the difference is:<br>


 %223 cannot be assigned to w0, evicts %381 from w15<br>


 %381 requeued for second round<br>


 %283 assigned to w15<br>


 %253 assigned to w15<br>


 %381 split for w15 in 1 bundle into %391 and %392<br>


  Neither is a local interval<br>


 %391 evicts %380 from w2<br>


 %392 assigned to w2<br>


<br>


So it looks like the difference is that with the patch we happen to split %381<br>


in a way that causes the split intervals to be allocated such that we get a pair<br>


of copies in bb.17.switchdest13, and this causes a cascade effect where we<br>


repeatedly do the same thing with a whole load of other registers.<br>


<br>


<br>


Possible Solutions<br>


------------------<br>


<br>


So there's two ways I can think of to fix this:<br>


 * Make %381 be split in the same way that it is without the patch, which I<br>


   think means deciding that there's only 1 bundle for w15. Does anyone know<br>


   where and how exactly these bundles are decided?<br>


 * Try and change how evicted / split registers are allocated in some way.<br>


   Things I've tried:<br>


  * In RAGreedy::enqueue reduce the score of unspillable local intervals, and in<br>


    RAGreedy::evictInterference put evicted registers into stage RS_Split<br>


    immediately. This causes %381 to be split immediately instead of being<br>


    requeued, and then makes %391 have a higher score than %253 causing it to<br>


    be allocated before it. This works, but ends up causing an extra spill.<br>


  * In RAGreedy::splitAroundRegion put global intervals into stage RS_Split<br>


    immediately. This makes the chain of evictions after %396 not happen, but<br>


    that gives us one extra spill and we still get one pair of copies in<br>


    bb.17.switchdest13.<br>


  * In RAGreedy::evictInterference put evicted registers into a new RS_Evicted<br>


    stage, which is like RS_Assign but can't evict anything. This seemed to give<br>


    OK results but was a mess and I didn't understand what I was doing, so I<br>


    threw it away.<br>


  * Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with<br>


    eviction chains like this. Unfortunately it doesn't work as it's a non-local<br>


    interval that's causing the eviction chain. I tried making it also handle<br>


    non-local intervals, but couldn't figure out how to.<br>


  * Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure<br>


    why and reading the description of that it may not be the correct solution<br>


    (it's described as being an option to reduce the time the register allocator<br>


    takes, not to give better allocation). The benchmark results are also<br>


    overall slightly worse.<br>


<br>


Any ideas on what the right approach to fixing this is?<br>


<br>


John<br>


<br>


_______________________________________________<br>


LLVM Developers mailing list<br>


<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>


<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>


</blockquote>


</div>


</div>


</div>


</body>


</html>