<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:12.0pt;
font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-reply;
font-family:"Calibri","sans-serif";
color:#1F497D;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri","sans-serif";
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-GB" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">enableAdvancedRASplitCost() does the same thing as ConsiderLocalIntervalCost, but as a<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">subtarget option instead of a command-line option, and as I’ve said it doesn’t help because<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">it’s a non-local interval causing the eviction chain (RAGreedy::splitCanCauseEvictionChain<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">only considers the local interval for a single block, and it’s unclear to me how to make it<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">handle a non-local interval).<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D">John<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D"><o:p> </o:p></span></p>
<div style="border:none;border-left:solid blue 1.5pt;padding:0cm 0cm 0cm 4.0pt">
<div>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif"">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:"Tahoma","sans-serif""> Nirav Davé [mailto:niravd@google.com]
<br>
<b>Sent:</b> 05 December 2018 17:14<br>
<b>To:</b> John Brawn<br>
<b>Cc:</b> llvm-dev; nd<br>
<b>Subject:</b> Re: [llvm-dev] Strange regalloc behaviour: one more available register causes much worse allocation<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<div>
<div>
<div>
<div>
<p class="MsoNormal">This has cropped up before in X86 (<a href="https://bugs.llvm.org/show_bug.cgi?id=26810">https://bugs.llvm.org/show_bug.cgi?id=26810</a> /
<a href="https://reviews.llvm.org/rL316295">https://reviews.llvm.org/rL316295</a>), and there's at least a partial mitigation <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">(I recently ran into an eviction change on X86 when trying variants of a MachineScheduler change, but couldn't find a reproduction post the landed patch).<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I suggest you try enabling enableAdvancedRASplitCost() for ARM and seeing if that helps.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">-Nirav<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
</div>
</div>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Wed, Dec 5, 2018 at 10:46 AM John Brawn via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<p class="MsoNormal">Preamble<br>
--------<br>
<br>
While working on an IR-level optimisation completely unrelated to register<br>
allocation I happened to trigger some really strange register allocator<br>
behaviour causing a large regression in bzip2 in spec2006. I've been trying<br>
to fix that regression before getting the optimisation patch committed, because<br>
I don't want to regress spec2006, but I'm basically fumbling in the dark because<br>
I don't yet know how or why the register allocator is making the decisions it<br>
does and I thought I'd send an email to see if anyone has any advice.<br>
<br>
<br>
The problem<br>
-----------<br>
<br>
Attached are (zipped, as llvm-dev has a 100kb message limit):<br>
* bzip2_regression.ll (reduced from bzip2 in spec2006 after being compiled with<br>
some patches that I'm working on) which demonstrates the problem.<br>
* 0001-AArch64-JumpTableDest-scratch-register-isn-t-earlycl.patch which causes<br>
the problem.<br>
* without_patch_regalloc.txt, the regalloc debug log for llc -mcpu=cortex-a57<br>
bzip2_regression.ll without the patch applied.<br>
* with_patch_regalloc.txt, the same log but with the patch applied.<br>
Note that the patch is not correct, but it happens to be a useful way of<br>
provoking the problem.<br>
<br>
Without the patch generating assembly with llc -mcpu=cortex-a57 everything looks<br>
fine, but with the patch we get this (which comes from the block<br>
bb.17.switchdest13):<br>
<br>
.LBB0_16:<br>
mov x29, x24<br>
mov w24, w20<br>
mov w20, w19<br>
mov w19, w7<br>
mov w7, w6<br>
mov w6, w5<br>
mov w5, w2<br>
mov x2, x18<br>
mov w18, w15<br>
orr w15, wzr, #0x1c<br>
str w15, [x8, #8]<br>
mov w0, wzr<br>
mov w15, w18<br>
mov x18, x2<br>
mov w2, w5<br>
mov w5, w6<br>
mov w6, w7<br>
mov w7, w19<br>
mov w19, w20<br>
mov w20, w24<br>
mov x24, x29<br>
b .LBB0_3<br>
<br>
It looks like the orr and str have barged in and said "we're using w15!" and all<br>
the rest of the registers have meekly moved out of the way and then moved back<br>
again at the and. If the orr and str had used w29 instead then none of this<br>
would have happened.<br>
<br>
What the patch does is make one of the input operands to the JumpTableDest32<br>
pseudo-instruction be not marked as earlyclobber, or in other words it means we<br>
have one extra register free compared to without the patch. And you would<br>
expect that more free registers = better register allocation, but in this case<br>
it appears we don't.<br>
<br>
Note: this problem can happen without the patch, but the test case is much much<br>
larger and manifested itself as -fno-omit-frame-pointer giving a better<br>
allocation than -fomit-frame-pointer. This patch was actually my first attempt<br>
at fixing this (as I'd noticed that we were unnecessarily keeping an extra<br>
register live across the JumpTableDest8).<br>
<br>
<br>
What's going on<br>
---------------<br>
<br>
What this block looks like after live range splitting has happened is:<br>
<br>
7352B bb.17.switchdest13:<br>
; predecessors: %bb.3<br>
successors: %bb.30(0x80000000); %bb.30(100.00%)<br>
<br>
7360B %390:gpr32 = COPY $wzr<br>
7364B %434:gpr64 = COPY %432:gpr64<br>
7368B %429:gpr32 = COPY %427:gpr32<br>
7376B %424:gpr32 = COPY %422:gpr32<br>
7384B %419:gpr32 = COPY %417:gpr32<br>
7392B %414:gpr32 = COPY %412:gpr32<br>
7400B %409:gpr32 = COPY %407:gpr32<br>
7408B %404:gpr32 = COPY %402:gpr32<br>
7416B %399:gpr64 = COPY %397:gpr64<br>
7424B %394:gpr32 = COPY %392:gpr32<br>
7528B %253:gpr32 = MOVi32imm 28<br>
7536B STRWui %253:gpr32, %182:gpr64common, 2 :: (store 4 into %ir.106, align 8)<br>
7752B %392:gpr32 = COPY %394:gpr32<br>
7756B %397:gpr64 = COPY %399:gpr64<br>
7764B %402:gpr32 = COPY %404:gpr32<br>
7768B %407:gpr32 = COPY %409:gpr32<br>
7776B %412:gpr32 = COPY %414:gpr32<br>
7780B %417:gpr32 = COPY %419:gpr32<br>
7788B %422:gpr32 = COPY %424:gpr32<br>
7792B %427:gpr32 = COPY %429:gpr32<br>
7800B %432:gpr64 = COPY %434:gpr64<br>
7808B %373:gpr64sp = IMPLICIT_DEF<br>
7816B %374:gpr64sp = IMPLICIT_DEF<br>
8048B B %bb.30<br>
<br>
Looking at the debug output of the register allocator, the sequence of events<br>
which kicks things off is<br>
%223 assigned to w0<br>
%283 evicts %381 from w15<br>
%381 requeued for second round<br>
%253 assigned to w15<br>
%381 split for w15 in 4 bundles into %391-%395<br>
%391, %392, %395 are not local intervals<br>
%393 is the local interval for bb.11.switchdest09<br>
%394 is the local interval for bb.17.switchdest13<br>
%392 assigned to w15<br>
%391 evicts %376 from w18<br>
%394 assigned to w18<br>
%376 split into %396-%400<br>
and then %396 evicts something which is split into something which evicts<br>
something etc. until we're done.<br>
<br>
Looking at what happens when this patch isn't applied the difference is:<br>
%223 cannot be assigned to w0, evicts %381 from w15<br>
%381 requeued for second round<br>
%283 assigned to w15<br>
%253 assigned to w15<br>
%381 split for w15 in 1 bundle into %391 and %392<br>
Neither is a local interval<br>
%391 evicts %380 from w2<br>
%392 assigned to w2<br>
<br>
So it looks like the difference is that with the patch we happen to split %381<br>
in a way that causes the split intervals to be allocated such that we get a pair<br>
of copies in bb.17.switchdest13, and this causes a cascade effect where we<br>
repeatedly do the same thing with a whole load of other registers.<br>
<br>
<br>
Possible Solutions<br>
------------------<br>
<br>
So there's two ways I can think of to fix this:<br>
* Make %381 be split in the same way that it is without the patch, which I<br>
think means deciding that there's only 1 bundle for w15. Does anyone know<br>
where and how exactly these bundles are decided?<br>
* Try and change how evicted / split registers are allocated in some way.<br>
Things I've tried:<br>
* In RAGreedy::enqueue reduce the score of unspillable local intervals, and in<br>
RAGreedy::evictInterference put evicted registers into stage RS_Split<br>
immediately. This causes %381 to be split immediately instead of being<br>
requeued, and then makes %391 have a higher score than %253 causing it to<br>
be allocated before it. This works, but ends up causing an extra spill.<br>
* In RAGreedy::splitAroundRegion put global intervals into stage RS_Split<br>
immediately. This makes the chain of evictions after %396 not happen, but<br>
that gives us one extra spill and we still get one pair of copies in<br>
bb.17.switchdest13.<br>
* In RAGreedy::evictInterference put evicted registers into a new RS_Evicted<br>
stage, which is like RS_Assign but can't evict anything. This seemed to give<br>
OK results but was a mess and I didn't understand what I was doing, so I<br>
threw it away.<br>
* Turn on the ConsiderLocalIntervalCost option, as it's supposed to help with<br>
eviction chains like this. Unfortunately it doesn't work as it's a non-local<br>
interval that's causing the eviction chain. I tried making it also handle<br>
non-local intervals, but couldn't figure out how to.<br>
* Turn on TRI->reverseLocalAssignment(). This seemed to work, but I'm not sure<br>
why and reading the description of that it may not be the correct solution<br>
(it's described as being an option to reduce the time the register allocator<br>
takes, not to give better allocation). The benchmark results are also<br>
overall slightly worse.<br>
<br>
Any ideas on what the right approach to fixing this is?<br>
<br>
John<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</blockquote>
</div>
</div>
</div>
</body>
</html>