<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Chiming in to this discussion just a little bit late...</p>
<p>I share Andy's concerns about whether the proposed trace
algorithm subtly improves on the existing code. I'm perfectly
happy to be convinced, but the burden of proof here is high. <br>
</p>
<p>Thinking about the current algorithm, I find it helpful to think
in terms of extended basic blocks. At it's most basic, the
existing code (and your alternate proposal unless I'm misreading
it) eagerly forms extended basic blocks with a single exit, but
potentially many early exits before a single N-way terminator. At
this stage, we're only handling "obvious" cases where the exit
path is substantially colder than the fall-through which continues
within the same EBB.<br>
</p>
<p>The "interesting bits" of both the current code and your new
algorithm can be phrased as merging these EBBs to form
super-blocks. Once we start merging EBBs, we have to allow
multiple entries into our super-blocks. One nice property of
loops, triangles, and diamonds is that control flow *entirely
within* a super-block is invisible to other super-blocks. Note
there's, for example, a policy choice as to whether cold loop
blocks are placed into the loops super-block (hiding the control
flow) or into a separate super-block (outlining the cold path far
away).</p>
<p>I suspect the current code could be substantially improved by
making the abstractions of EBBs and super-blocks explicit in the
code. Then the heuristic pieces would be much more obvious and
easier to evaluate in isolation. <br>
</p>
<p>Philip<br>
</p>
<div class="moz-cite-prefix">On 09/18/2017 06:05 PM, Andrew Trick
via llvm-dev wrote:<br>
</div>
<blockquote type="cite"
cite="mid:BEA03C92-43DF-485E-AE8F-55F341940C24@apple.com">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<br class="">
<div>
<blockquote type="cite" class="">
<div class="">On Sep 18, 2017, at 5:17 PM, Kyle Butt <<a
href="mailto:iteratee@google.com" class=""
moz-do-not-send="true">iteratee@google.com</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class=""><br class="">
<div class="gmail_extra"><br class="">
<div class="gmail_quote">On Mon, Sep 18, 2017 at 1:16
PM, Andrew Trick <span dir="ltr" class=""><<a
href="mailto:atrick@apple.com" target="_blank"
class="cremed" moz-do-not-send="true">atrick@apple.com</a>></span>
wrote:<br class="">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word" class=""><br
class="">
<div class="">
<blockquote type="cite" class="">
<div class="">
<div class="h5">
<div class="">On Sep 14, 2017, at 6:53 PM,
Kyle Butt via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org"
target="_blank" class="cremed"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
wrote:</div>
<br
class="m_-4149091655691126002Apple-interchange-newline">
</div>
</div>
<div class="">
<div class="">
<div class="h5">
<div dir="ltr" class="">
<div style="font-size:12.8px" class="">I
plan on rewriting the block
placement algorithm to proceed by
traces.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">A
trace is a chain of blocks where
each block in the chain may fall
through to</div>
<div style="font-size:12.8px" class="">the
successor in the chain.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">The
overall algorithm would be to first
produce traces for a function, and
then</div>
<div style="font-size:12.8px" class="">order
those traces to try and get cache
locality.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Currently
block placement uses a greedy single
step approach to layout. It</div>
<div style="font-size:12.8px" class="">produces
chains working from inner to outer
loops. Unlike a trace, a chain may</div>
<div style="font-size:12.8px" class="">contain
non-fallthrough edges. This causes
problems with loop layout. The main</div>
<div style="font-size:12.8px" class="">problems
with loop layout are: loop rotation
and cold blocks in a loop.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Overview
of proposed solution:</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Phase
1:</div>
<div style="font-size:12.8px" class="">Greedily
produce a set of traces through the
function. A trace is a list of</div>
<div style="font-size:12.8px" class="">blocks
with each block in the list falling
through (possibly conditionally) to</div>
<div style="font-size:12.8px" class="">the
next block in the list. Loop
rotation will occur naturally in
this phase via</div>
<div style="font-size:12.8px" class="">the
triangle replacement algorithm
below. Handling single trace loops
requires a</div>
<div style="font-size:12.8px" class="">tweak,
see the detailed design.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Phase
2:</div>
<div style="font-size:12.8px" class="">After
producing what we believe are the
best traces, they need to be
ordered.</div>
<div style="font-size:12.8px" class="">They
will be ordered topologically,
except that traces that are cold
enough (As</div>
<div style="font-size:12.8px" class="">measured
by their warmest block) will be
floated later, This may push them
out</div>
<div style="font-size:12.8px" class="">of
a loop or to the end of the
function.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Detailed
Design</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Note
whenever an edge is used as a
number, I am referring to the edge
frequency.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Phase
1: Producing traces</div>
<div style="font-size:12.8px" class="">Traces
are produced according to the
following algorithm:</div>
<div style="font-size:12.8px" class=""> *
Sort the edges according to weight,
stable-sorting them according the
incoming</div>
<div style="font-size:12.8px" class="">block
and edge ordering.</div>
<div style="font-size:12.8px" class=""> *
Place each block in a trace of
length 1.</div>
<div style="font-size:12.8px" class=""> *
For each edge in order:</div>
<div style="font-size:12.8px" class="">
* If the source is at the end of a
trace, and the target is at the
beginning</div>
<div style="font-size:12.8px" class="">
of a trace, glue those 2 traces
into 1 longer trace.</div>
<div style="font-size:12.8px" class="">
* If an edge has a target or
source in the middle of another
trace, consider</div>
<div style="font-size:12.8px" class="">
tail duplication. The benefit
calculation is the same as the
existing</div>
<div style="font-size:12.8px" class="">
code.</div>
<div style="font-size:12.8px" class="">
* If an edge has a source or
target in the middle, check them to
see if they</div>
<div style="font-size:12.8px" class="">
can be replaced as a triangle.
(Triangle replacement described
below)</div>
<div style="font-size:12.8px" class="">
* Compare the benefit of
choosing the edge, along with any
triangles</div>
<div style="font-size:12.8px" class="">
found, with the cost of
breaking the existing edges.</div>
<div style="font-size:12.8px" class="">
* If it is a net benefit,
perform the switch.</div>
<div style="font-size:12.8px" class=""> *
Triangle checking:</div>
<div style="font-size:12.8px" class="">
Consider a trace in 2 parts:
A1->A2, and the current edge
under consideration</div>
<div style="font-size:12.8px" class="">
is A1->B (the case for C->A2
is mirror, and both may need to be
done)</div>
<div style="font-size:12.8px" class="">
* First find the best alternative
C->B</div>
<div style="font-size:12.8px" class="">
* Check for an alternative for A2:
D->A2</div>
<div style="font-size:12.8px" class="">
* Find D's best Alternative:
D->E</div>
<div style="font-size:12.8px" class="">
* Compare the frequencies:
A1->A2 + C->B + D->E vs
A1->B + D->A2</div>
<div style="font-size:12.8px" class="">
* If the 2nd sum is bigger, do the
switch.</div>
<div style="font-size:12.8px" class="">
* Loop Rotation Tweak:</div>
<div style="font-size:12.8px" class="">
If A contains a backedge
A2->A1, then when considering
A1->B or C->A2, we</div>
<div style="font-size:12.8px" class="">
can include that backedge in the
gain:</div>
<div style="font-size:12.8px" class="">
A1->A2 + C->D + E->B vs
A1->B + C->A2 + A2->A</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Phase
2: Order traces.</div>
<div style="font-size:12.8px" class="">First
we compute the frequency of a trace
by finding the max frequency of any
of</div>
<div style="font-size:12.8px" class="">its
blocks.</div>
<div style="font-size:12.8px" class="">Then
we attempt to place the traces
topologically. When a trace cannot
be placed</div>
<div style="font-size:12.8px" class="">topologically,
we prefer warmer traces first.</div>
<div style="font-size:12.8px" class=""><br
class="">
</div>
<div style="font-size:12.8px" class="">Questions
and comments welcome.</div>
</div>
</div>
</div>
______________________________<wbr class="">_________________<br
class="">
</div>
</blockquote>
<br class="">
</div>
<div class="">This algorithm should be easy enough
to implement and experiment with out of tree. A
reasonable goal would be to identify cases that
the current, more sophisticated algorithm are
handling poorly. At that point, it would be much
more useful to fix the current algorithm’s
heuristics rather than introducing another less
sophisticated alternative.</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I wrote many of the current algorithms
heuristics, and they feel like hacks to work around
specific problems.</div>
<div class=""><br class="">
</div>
<div class="">What I'm proposing is more, not less
sophisticated. Handling loop rotation as part of a
general lookahead problem is more sophisticated than
the existing method of laying out the loop and
praying that one of the n rotations will be a good
one.</div>
<div class="">Even the tail-duplication support I
added is kind of a hack. It would be better to do
block cloning to handle cases where tail-duplication
currently fails</div>
</div>
</div>
</div>
</div>
</blockquote>
<div><br class="">
</div>
The “Algo2” that you referred to is the least sophisticated
layout algorithm I can imagine. I’ve always found it to be
extremely unsatisfying when it comes to partial profile,
unbiased branches, or general debugging sanity.</div>
<div><br class="">
</div>
<div>When I say the existing algorithm is more sophisticated, I
mean that it is designed to meet a couple basic requirements:</div>
<div>- contiguous loops</div>
<div>- topologically ordered regions with a loop</div>
<div>With the exception that clearly cold paths are moved to the
end of the function.</div>
<div><br class="">
</div>
<div>I’m not sure how well your "triangle look-ahead” algorithm
works, but if you can also meet those requirements then I would
say it’s also very sophisticated. I haven’t spent any time
evaluating the current LLVM algorithm, so won’t defend it over
another approach that meets the same goals.</div>
<div><br class="">
</div>
<div>I am very curious to see examples of situations where your
approach yields better performance.</div>
<div><br class="">
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word" class="">
<div class="">The current algorithm deals well with
partial profile information and maintains a number
of layout properties relative to the CFG
structure. I’m not the best person to explain all
of this, but you will certainly need to understand
the original goals, show concrete examples of how
it “fails” and can’t be easily fixed, before you
propose replacing it.</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">I'll make sure to include test cases with
any diffs I mail out.</div>
</div>
</div>
</div>
</blockquote>
<br class="">
</div>
<div>Ok. I’m not opposed to someone who has a lot of experience
with the current algorithm rewriting it. I’m just concerned that
it will result in a lot of arbitrary code shuffling, making it
difficult to make sense of the codegen output, and result in
worse block layout for the typical situation: many missing
branch weights, or workloads that deviate from the profile.</div>
<div><br class="">
</div>
<div>-Andy</div>
<br class="">
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<br>
</body>
</html>