<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Chiming in to this discussion just a little bit late...</p>
    <p>I share Andy's concerns about whether the proposed trace
      algorithm subtly improves on the existing code.  I'm perfectly
      happy to be convinced, but the burden of proof here is high.  <br>
    </p>
    <p>Thinking about the current algorithm, I find it helpful to think
      in terms of extended basic blocks.  At it's most basic, the
      existing code (and your alternate proposal unless I'm misreading
      it) eagerly forms extended basic blocks with a single exit, but
      potentially many early exits before a single N-way terminator.  At
      this stage, we're only handling "obvious" cases where the exit
      path is substantially colder than the fall-through which continues
      within the same EBB.<br>
    </p>
    <p>The "interesting bits" of both the current code and your new
      algorithm can be phrased as merging these EBBs to form
      super-blocks.  Once we start merging EBBs, we have to allow
      multiple entries into our super-blocks.  One nice property of
      loops, triangles, and diamonds is that control flow *entirely
      within* a super-block is invisible to other super-blocks.  Note
      there's, for example, a policy choice as to whether cold loop
      blocks are placed into the loops super-block (hiding the control
      flow) or into a separate super-block (outlining the cold path far
      away).</p>
    <p>I suspect the current code could be substantially improved by
      making the abstractions of EBBs and super-blocks explicit in the
      code.  Then the heuristic pieces would be much more obvious and
      easier to evaluate in isolation. <br>
    </p>
    <p>Philip<br>
    </p>
    <div class="moz-cite-prefix">On 09/18/2017 06:05 PM, Andrew Trick
      via llvm-dev wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:BEA03C92-43DF-485E-AE8F-55F341940C24@apple.com">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <br class="">
      <div>
        <blockquote type="cite" class="">
          <div class="">On Sep 18, 2017, at 5:17 PM, Kyle Butt <<a
              href="mailto:iteratee@google.com" class=""
              moz-do-not-send="true">iteratee@google.com</a>> wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div dir="ltr" class=""><br class="">
              <div class="gmail_extra"><br class="">
                <div class="gmail_quote">On Mon, Sep 18, 2017 at 1:16
                  PM, Andrew Trick <span dir="ltr" class=""><<a
                      href="mailto:atrick@apple.com" target="_blank"
                      class="cremed" moz-do-not-send="true">atrick@apple.com</a>></span>
                  wrote:<br class="">
                  <blockquote class="gmail_quote" style="margin:0 0 0
                    .8ex;border-left:1px #ccc solid;padding-left:1ex">
                    <div style="word-wrap:break-word" class=""><br
                        class="">
                      <div class="">
                        <blockquote type="cite" class="">
                          <div class="">
                            <div class="h5">
                              <div class="">On Sep 14, 2017, at 6:53 PM,
                                Kyle Butt via llvm-dev <<a
                                  href="mailto:llvm-dev@lists.llvm.org"
                                  target="_blank" class="cremed"
                                  moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
                                wrote:</div>
                              <br
                                class="m_-4149091655691126002Apple-interchange-newline">
                            </div>
                          </div>
                          <div class="">
                            <div class="">
                              <div class="h5">
                                <div dir="ltr" class="">
                                  <div style="font-size:12.8px" class="">I
                                    plan on rewriting the block
                                    placement algorithm to proceed by
                                    traces.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">A
                                    trace is a chain of blocks where
                                    each block in the chain may fall
                                    through to</div>
                                  <div style="font-size:12.8px" class="">the
                                    successor in the chain.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">The
                                    overall algorithm would be to first
                                    produce traces for a function, and
                                    then</div>
                                  <div style="font-size:12.8px" class="">order
                                    those traces to try and get cache
                                    locality.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Currently
                                    block placement uses a greedy single
                                    step approach to layout. It</div>
                                  <div style="font-size:12.8px" class="">produces
                                    chains working from inner to outer
                                    loops. Unlike a trace, a chain may</div>
                                  <div style="font-size:12.8px" class="">contain
                                    non-fallthrough edges. This causes
                                    problems with loop layout. The main</div>
                                  <div style="font-size:12.8px" class="">problems
                                    with loop layout are: loop rotation
                                    and cold blocks in a loop.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Overview
                                    of proposed solution:</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Phase
                                    1:</div>
                                  <div style="font-size:12.8px" class="">Greedily
                                    produce a set of traces through the
                                    function. A trace is a list of</div>
                                  <div style="font-size:12.8px" class="">blocks
                                    with each block in the list falling
                                    through (possibly conditionally) to</div>
                                  <div style="font-size:12.8px" class="">the
                                    next block in the list. Loop
                                    rotation will occur naturally in
                                    this phase via</div>
                                  <div style="font-size:12.8px" class="">the
                                    triangle replacement algorithm
                                    below. Handling single trace loops
                                    requires a</div>
                                  <div style="font-size:12.8px" class="">tweak,
                                    see the detailed design.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Phase
                                    2:</div>
                                  <div style="font-size:12.8px" class="">After
                                    producing what we believe are the
                                    best traces, they need to be
                                    ordered.</div>
                                  <div style="font-size:12.8px" class="">They
                                    will be ordered topologically,
                                    except that traces that are cold
                                    enough (As</div>
                                  <div style="font-size:12.8px" class="">measured
                                    by their warmest block) will be
                                    floated later, This may push them
                                    out</div>
                                  <div style="font-size:12.8px" class="">of
                                    a loop or to the end of the
                                    function.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Detailed
                                    Design</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Note
                                    whenever an edge is used as a
                                    number, I am referring to the edge
                                    frequency.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Phase
                                    1: Producing traces</div>
                                  <div style="font-size:12.8px" class="">Traces
                                    are produced according to the
                                    following algorithm:</div>
                                  <div style="font-size:12.8px" class=""> *
                                    Sort the edges according to weight,
                                    stable-sorting them according the
                                    incoming</div>
                                  <div style="font-size:12.8px" class="">block
                                    and edge ordering.</div>
                                  <div style="font-size:12.8px" class=""> *
                                    Place each block in a trace of
                                    length 1.</div>
                                  <div style="font-size:12.8px" class=""> *
                                    For each edge in order:</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * If the source is at the end of a
                                    trace, and the target is at the
                                    beginning</div>
                                  <div style="font-size:12.8px" class=""> 
                                        of a trace, glue those 2 traces
                                    into 1 longer trace.</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * If an edge has a target or
                                    source in the middle of another
                                    trace, consider</div>
                                  <div style="font-size:12.8px" class=""> 
                                        tail duplication. The benefit
                                    calculation is the same as the
                                    existing</div>
                                  <div style="font-size:12.8px" class=""> 
                                        code.</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * If an edge has a source or
                                    target in the middle, check them to
                                    see if they</div>
                                  <div style="font-size:12.8px" class=""> 
                                        can be replaced as a triangle.
                                    (Triangle replacement described
                                    below)</div>
                                  <div style="font-size:12.8px" class=""> 
                                        * Compare the benefit of
                                    choosing the edge, along with any
                                    triangles</div>
                                  <div style="font-size:12.8px" class=""> 
                                          found, with the cost of
                                    breaking the existing edges.</div>
                                  <div style="font-size:12.8px" class=""> 
                                          * If it is a net benefit,
                                    perform the switch.</div>
                                  <div style="font-size:12.8px" class=""> *
                                    Triangle checking:</div>
                                  <div style="font-size:12.8px" class=""> 
                                      Consider a trace in 2 parts:
                                    A1->A2, and the current edge
                                    under consideration</div>
                                  <div style="font-size:12.8px" class=""> 
                                      is A1->B (the case for C->A2
                                    is mirror, and both may need to be
                                    done)</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * First find the best alternative
                                    C->B</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * Check for an alternative for A2:
                                    D->A2</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * Find D's best Alternative:
                                    D->E</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * Compare the frequencies:
                                    A1->A2 + C->B + D->E vs
                                    A1->B + D->A2</div>
                                  <div style="font-size:12.8px" class=""> 
                                      * If the 2nd sum is bigger, do the
                                    switch.</div>
                                  <div style="font-size:12.8px" class=""> 
                                    * Loop Rotation Tweak:</div>
                                  <div style="font-size:12.8px" class=""> 
                                      If A contains a backedge
                                    A2->A1, then when considering
                                    A1->B or C->A2, we</div>
                                  <div style="font-size:12.8px" class=""> 
                                      can include that backedge in the
                                    gain:</div>
                                  <div style="font-size:12.8px" class=""> 
                                      A1->A2 + C->D + E->B vs
                                    A1->B + C->A2 + A2->A</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Phase
                                    2: Order traces.</div>
                                  <div style="font-size:12.8px" class="">First
                                    we compute the frequency of a trace
                                    by finding the max frequency of any
                                    of</div>
                                  <div style="font-size:12.8px" class="">its
                                    blocks.</div>
                                  <div style="font-size:12.8px" class="">Then
                                    we attempt to place the traces
                                    topologically. When a trace cannot
                                    be placed</div>
                                  <div style="font-size:12.8px" class="">topologically,
                                    we prefer warmer traces first.</div>
                                  <div style="font-size:12.8px" class=""><br
                                      class="">
                                  </div>
                                  <div style="font-size:12.8px" class="">Questions
                                    and comments welcome.</div>
                                </div>
                              </div>
                            </div>
                            ______________________________<wbr class="">_________________<br
                              class="">
                          </div>
                        </blockquote>
                        <br class="">
                      </div>
                      <div class="">This algorithm should be easy enough
                        to implement and experiment with out of tree. A
                        reasonable goal would be to identify cases that
                        the current, more sophisticated algorithm are
                        handling poorly. At that point, it would be much
                        more useful to fix the current algorithm’s
                        heuristics rather than introducing another less
                        sophisticated alternative.</div>
                    </div>
                  </blockquote>
                  <div class=""><br class="">
                  </div>
                  <div class="">I wrote many of the current algorithms
                    heuristics, and they feel like hacks to work around
                    specific problems.</div>
                  <div class=""><br class="">
                  </div>
                  <div class="">What I'm proposing is more, not less
                    sophisticated. Handling loop rotation as part of a
                    general lookahead problem is more sophisticated than
                    the existing method of laying out the loop and
                    praying that one of the n rotations will be a good
                    one.</div>
                  <div class="">Even the tail-duplication support I
                    added is kind of a hack. It would be better to do
                    block cloning to handle cases where tail-duplication
                    currently fails</div>
                </div>
              </div>
            </div>
          </div>
        </blockquote>
        <div><br class="">
        </div>
        The “Algo2” that you referred to is the least sophisticated
        layout algorithm I can imagine. I’ve always found it to be
        extremely unsatisfying when it comes to partial profile,
        unbiased branches, or general debugging sanity.</div>
      <div><br class="">
      </div>
      <div>When I say the existing algorithm is more sophisticated, I
        mean that it is designed to meet a couple basic requirements:</div>
      <div>- contiguous loops</div>
      <div>- topologically ordered regions with a loop</div>
      <div>With the exception that clearly cold paths are moved to the
        end of the function.</div>
      <div><br class="">
      </div>
      <div>I’m not sure how well your "triangle look-ahead” algorithm
        works, but if you can also meet those requirements then I would
        say it’s also very sophisticated. I haven’t spent any time
        evaluating the current LLVM algorithm, so won’t defend it over
        another approach that meets the same goals.</div>
      <div><br class="">
      </div>
      <div>I am very curious to see examples of situations where your
        approach yields better performance.</div>
      <div><br class="">
        <blockquote type="cite" class="">
          <div dir="ltr" class="">
            <div class="gmail_extra">
              <div class="gmail_quote">
                <blockquote class="gmail_quote" style="margin:0 0 0
                  .8ex;border-left:1px #ccc solid;padding-left:1ex">
                  <div style="word-wrap:break-word" class="">
                    <div class="">The current algorithm deals well with
                      partial profile information and maintains a number
                      of layout properties relative to the CFG
                      structure. I’m not the best person to explain all
                      of this, but you will certainly need to understand
                      the original goals, show concrete examples of how
                      it “fails” and can’t be easily fixed, before you
                      propose replacing it.</div>
                  </div>
                </blockquote>
                <div class=""><br class="">
                </div>
                <div class="">I'll make sure to include test cases with
                  any diffs I mail out.</div>
              </div>
            </div>
          </div>
        </blockquote>
        <br class="">
      </div>
      <div>Ok. I’m not opposed to someone who has a lot of experience
        with the current algorithm rewriting it. I’m just concerned that
        it will result in a lot of arbitrary code shuffling, making it
        difficult to make sense of the codegen output, and result in
        worse block layout for the typical situation: many missing
        branch weights, or workloads that deviate from the profile.</div>
      <div><br class="">
      </div>
      <div>-Andy</div>
      <br class="">
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <br>
      <pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>
<a class="moz-txt-link-freetext" href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
    </blockquote>
    <br>
  </body>
</html>