<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Chiming in to this discussion just a little bit late...</p>

    <p>I share Andy's concerns about whether the proposed trace

      algorithm subtly improves on the existing code.  I'm perfectly

      happy to be convinced, but the burden of proof here is high.  <br>

    </p>

    <p>Thinking about the current algorithm, I find it helpful to think

      in terms of extended basic blocks.  At it's most basic, the

      existing code (and your alternate proposal unless I'm misreading

      it) eagerly forms extended basic blocks with a single exit, but

      potentially many early exits before a single N-way terminator.  At

      this stage, we're only handling "obvious" cases where the exit

      path is substantially colder than the fall-through which continues

      within the same EBB.<br>

    </p>

    <p>The "interesting bits" of both the current code and your new

      algorithm can be phrased as merging these EBBs to form

      super-blocks.  Once we start merging EBBs, we have to allow

      multiple entries into our super-blocks.  One nice property of

      loops, triangles, and diamonds is that control flow *entirely

      within* a super-block is invisible to other super-blocks.  Note

      there's, for example, a policy choice as to whether cold loop

      blocks are placed into the loops super-block (hiding the control

      flow) or into a separate super-block (outlining the cold path far

      away).</p>

    <p>I suspect the current code could be substantially improved by

      making the abstractions of EBBs and super-blocks explicit in the

      code.  Then the heuristic pieces would be much more obvious and

      easier to evaluate in isolation. <br>

    </p>

    <p>Philip<br>

    </p>

    <div class="moz-cite-prefix">On 09/18/2017 06:05 PM, Andrew Trick

      via llvm-dev wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:BEA03C92-43DF-485E-AE8F-55F341940C24@apple.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <br class="">

      <div>

        <blockquote type="cite" class="">

          <div class="">On Sep 18, 2017, at 5:17 PM, Kyle Butt <<a

              href="mailto:iteratee@google.com" class=""

              moz-do-not-send="true">iteratee@google.com</a>> wrote:</div>

          <br class="Apple-interchange-newline">

          <div class="">

            <div dir="ltr" class=""><br class="">

              <div class="gmail_extra"><br class="">

                <div class="gmail_quote">On Mon, Sep 18, 2017 at 1:16

                  PM, Andrew Trick <span dir="ltr" class=""><<a

                      href="mailto:atrick@apple.com" target="_blank"

                      class="cremed" moz-do-not-send="true">atrick@apple.com</a>></span>

                  wrote:<br class="">

                  <blockquote class="gmail_quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    <div style="word-wrap:break-word" class=""><br

                        class="">

                      <div class="">

                        <blockquote type="cite" class="">

                          <div class="">

                            <div class="h5">

                              <div class="">On Sep 14, 2017, at 6:53 PM,

                                Kyle Butt via llvm-dev <<a

                                  href="mailto:llvm-dev@lists.llvm.org"

                                  target="_blank" class="cremed"

                                  moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

                                wrote:</div>

                              <br

                                class="m_-4149091655691126002Apple-interchange-newline">

                            </div>

                          </div>

                          <div class="">

                            <div class="">

                              <div class="h5">

                                <div dir="ltr" class="">

                                  <div style="font-size:12.8px" class="">I

                                    plan on rewriting the block

                                    placement algorithm to proceed by

                                    traces.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">A

                                    trace is a chain of blocks where

                                    each block in the chain may fall

                                    through to</div>

                                  <div style="font-size:12.8px" class="">the

                                    successor in the chain.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">The

                                    overall algorithm would be to first

                                    produce traces for a function, and

                                    then</div>

                                  <div style="font-size:12.8px" class="">order

                                    those traces to try and get cache

                                    locality.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Currently

                                    block placement uses a greedy single

                                    step approach to layout. It</div>

                                  <div style="font-size:12.8px" class="">produces

                                    chains working from inner to outer

                                    loops. Unlike a trace, a chain may</div>

                                  <div style="font-size:12.8px" class="">contain

                                    non-fallthrough edges. This causes

                                    problems with loop layout. The main</div>

                                  <div style="font-size:12.8px" class="">problems

                                    with loop layout are: loop rotation

                                    and cold blocks in a loop.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Overview

                                    of proposed solution:</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Phase

                                    1:</div>

                                  <div style="font-size:12.8px" class="">Greedily

                                    produce a set of traces through the

                                    function. A trace is a list of</div>

                                  <div style="font-size:12.8px" class="">blocks

                                    with each block in the list falling

                                    through (possibly conditionally) to</div>

                                  <div style="font-size:12.8px" class="">the

                                    next block in the list. Loop

                                    rotation will occur naturally in

                                    this phase via</div>

                                  <div style="font-size:12.8px" class="">the

                                    triangle replacement algorithm

                                    below. Handling single trace loops

                                    requires a</div>

                                  <div style="font-size:12.8px" class="">tweak,

                                    see the detailed design.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Phase

                                    2:</div>

                                  <div style="font-size:12.8px" class="">After

                                    producing what we believe are the

                                    best traces, they need to be

                                    ordered.</div>

                                  <div style="font-size:12.8px" class="">They

                                    will be ordered topologically,

                                    except that traces that are cold

                                    enough (As</div>

                                  <div style="font-size:12.8px" class="">measured

                                    by their warmest block) will be

                                    floated later, This may push them

                                    out</div>

                                  <div style="font-size:12.8px" class="">of

                                    a loop or to the end of the

                                    function.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Detailed

                                    Design</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Note

                                    whenever an edge is used as a

                                    number, I am referring to the edge

                                    frequency.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Phase

                                    1: Producing traces</div>

                                  <div style="font-size:12.8px" class="">Traces

                                    are produced according to the

                                    following algorithm:</div>

                                  <div style="font-size:12.8px" class=""> *

                                    Sort the edges according to weight,

                                    stable-sorting them according the

                                    incoming</div>

                                  <div style="font-size:12.8px" class="">block

                                    and edge ordering.</div>

                                  <div style="font-size:12.8px" class=""> *

                                    Place each block in a trace of

                                    length 1.</div>

                                  <div style="font-size:12.8px" class=""> *

                                    For each edge in order:</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * If the source is at the end of a

                                    trace, and the target is at the

                                    beginning</div>

                                  <div style="font-size:12.8px" class=""> 

                                        of a trace, glue those 2 traces

                                    into 1 longer trace.</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * If an edge has a target or

                                    source in the middle of another

                                    trace, consider</div>

                                  <div style="font-size:12.8px" class=""> 

                                        tail duplication. The benefit

                                    calculation is the same as the

                                    existing</div>

                                  <div style="font-size:12.8px" class=""> 

                                        code.</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * If an edge has a source or

                                    target in the middle, check them to

                                    see if they</div>

                                  <div style="font-size:12.8px" class=""> 

                                        can be replaced as a triangle.

                                    (Triangle replacement described

                                    below)</div>

                                  <div style="font-size:12.8px" class=""> 

                                        * Compare the benefit of

                                    choosing the edge, along with any

                                    triangles</div>

                                  <div style="font-size:12.8px" class=""> 

                                          found, with the cost of

                                    breaking the existing edges.</div>

                                  <div style="font-size:12.8px" class=""> 

                                          * If it is a net benefit,

                                    perform the switch.</div>

                                  <div style="font-size:12.8px" class=""> *

                                    Triangle checking:</div>

                                  <div style="font-size:12.8px" class=""> 

                                      Consider a trace in 2 parts:

                                    A1->A2, and the current edge

                                    under consideration</div>

                                  <div style="font-size:12.8px" class=""> 

                                      is A1->B (the case for C->A2

                                    is mirror, and both may need to be

                                    done)</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * First find the best alternative

                                    C->B</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * Check for an alternative for A2:

                                    D->A2</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * Find D's best Alternative:

                                    D->E</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * Compare the frequencies:

                                    A1->A2 + C->B + D->E vs

                                    A1->B + D->A2</div>

                                  <div style="font-size:12.8px" class=""> 

                                      * If the 2nd sum is bigger, do the

                                    switch.</div>

                                  <div style="font-size:12.8px" class=""> 

                                    * Loop Rotation Tweak:</div>

                                  <div style="font-size:12.8px" class=""> 

                                      If A contains a backedge

                                    A2->A1, then when considering

                                    A1->B or C->A2, we</div>

                                  <div style="font-size:12.8px" class=""> 

                                      can include that backedge in the

                                    gain:</div>

                                  <div style="font-size:12.8px" class=""> 

                                      A1->A2 + C->D + E->B vs

                                    A1->B + C->A2 + A2->A</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Phase

                                    2: Order traces.</div>

                                  <div style="font-size:12.8px" class="">First

                                    we compute the frequency of a trace

                                    by finding the max frequency of any

                                    of</div>

                                  <div style="font-size:12.8px" class="">its

                                    blocks.</div>

                                  <div style="font-size:12.8px" class="">Then

                                    we attempt to place the traces

                                    topologically. When a trace cannot

                                    be placed</div>

                                  <div style="font-size:12.8px" class="">topologically,

                                    we prefer warmer traces first.</div>

                                  <div style="font-size:12.8px" class=""><br

                                      class="">

                                  </div>

                                  <div style="font-size:12.8px" class="">Questions

                                    and comments welcome.</div>

                                </div>

                              </div>

                            </div>

                            ______________________________<wbr class="">_________________<br

                              class="">

                          </div>

                        </blockquote>

                        <br class="">

                      </div>

                      <div class="">This algorithm should be easy enough

                        to implement and experiment with out of tree. A

                        reasonable goal would be to identify cases that

                        the current, more sophisticated algorithm are

                        handling poorly. At that point, it would be much

                        more useful to fix the current algorithm’s

                        heuristics rather than introducing another less

                        sophisticated alternative.</div>

                    </div>

                  </blockquote>

                  <div class=""><br class="">

                  </div>

                  <div class="">I wrote many of the current algorithms

                    heuristics, and they feel like hacks to work around

                    specific problems.</div>

                  <div class=""><br class="">

                  </div>

                  <div class="">What I'm proposing is more, not less

                    sophisticated. Handling loop rotation as part of a

                    general lookahead problem is more sophisticated than

                    the existing method of laying out the loop and

                    praying that one of the n rotations will be a good

                    one.</div>

                  <div class="">Even the tail-duplication support I

                    added is kind of a hack. It would be better to do

                    block cloning to handle cases where tail-duplication

                    currently fails</div>

                </div>

              </div>

            </div>

          </div>

        </blockquote>

        <div><br class="">

        </div>

        The “Algo2” that you referred to is the least sophisticated

        layout algorithm I can imagine. I’ve always found it to be

        extremely unsatisfying when it comes to partial profile,

        unbiased branches, or general debugging sanity.</div>

      <div><br class="">

      </div>

      <div>When I say the existing algorithm is more sophisticated, I

        mean that it is designed to meet a couple basic requirements:</div>

      <div>- contiguous loops</div>

      <div>- topologically ordered regions with a loop</div>

      <div>With the exception that clearly cold paths are moved to the

        end of the function.</div>

      <div><br class="">

      </div>

      <div>I’m not sure how well your "triangle look-ahead” algorithm

        works, but if you can also meet those requirements then I would

        say it’s also very sophisticated. I haven’t spent any time

        evaluating the current LLVM algorithm, so won’t defend it over

        another approach that meets the same goals.</div>

      <div><br class="">

      </div>

      <div>I am very curious to see examples of situations where your

        approach yields better performance.</div>

      <div><br class="">

        <blockquote type="cite" class="">

          <div dir="ltr" class="">

            <div class="gmail_extra">

              <div class="gmail_quote">

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <div style="word-wrap:break-word" class="">

                    <div class="">The current algorithm deals well with

                      partial profile information and maintains a number

                      of layout properties relative to the CFG

                      structure. I’m not the best person to explain all

                      of this, but you will certainly need to understand

                      the original goals, show concrete examples of how

                      it “fails” and can’t be easily fixed, before you

                      propose replacing it.</div>

                  </div>

                </blockquote>

                <div class=""><br class="">

                </div>

                <div class="">I'll make sure to include test cases with

                  any diffs I mail out.</div>

              </div>

            </div>

          </div>

        </blockquote>

        <br class="">

      </div>

      <div>Ok. I’m not opposed to someone who has a lot of experience

        with the current algorithm rewriting it. I’m just concerned that

        it will result in a lot of arbitrary code shuffling, making it

        difficult to make sense of the codegen output, and result in

        worse block layout for the typical situation: many missing

        branch weights, or workloads that deviate from the profile.</div>

      <div><br class="">

      </div>

      <div>-Andy</div>

      <br class="">

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

LLVM Developers mailing list

<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>

<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>