<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 09/22/2017 12:25 AM, Mehdi AMINI
      wrote:<br>
    </div>
    <blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="ltr"><br>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">2017-09-21 22:22 GMT-07:00 Hal Finkel
            <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000">
                <div>
                  <div class="h5">
                    <p><br>
                    </p>
                    <div class="m_-8232648019342385258moz-cite-prefix">On
                      09/22/2017 12:03 AM, Mehdi AMINI wrote:<br>
                    </div>
                    <blockquote type="cite">
                      <div dir="ltr">Hi Hal,
                        <div><br>
                        </div>
                        <div class="gmail_extra"><br>
                          <div class="gmail_quote">2017-09-21 20:59
                            GMT-07:00 Hal Finkel via llvm-dev <span
                              dir="ltr"><<a moz-do-not-send="true"
                                href="mailto:llvm-dev@lists.llvm.org"
                                target="_blank">llvm-dev@lists.llvm.org</a>></span>:<br>
                            <blockquote class="gmail_quote"
                              style="margin:0 0 0 .8ex;border-left:1px
                              #ccc solid;padding-left:1ex">
                              <div bgcolor="#FFFFFF" text="#000000"><span>
                                  <br>
                                  <div
                                    class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
                                    09/12/2017 10:26 PM, Gerolf
                                    Hoflehner wrote:<br>
                                  </div>
                                </span>
                                <blockquote type="cite">
                                  <div dir="auto"
                                    style="word-wrap:break-word;line-break:after-white-space"><br>
                                    <div><br>
                                      <blockquote type="cite"><span>
                                          <div>On Sep 11, 2017, at 10:47
                                            PM, Hal Finkel via llvm-dev
                                            <<a
                                              moz-do-not-send="true"
                                              href="mailto:llvm-dev@lists.llvm.org"
                                              target="_blank">llvm-dev@lists.llvm.org</a>>
                                            wrote:</div>
                                          <br
class="m_-8232648019342385258m_-2166750394864443591Apple-interchange-newline">
                                        </span>
                                        <div>
                                          <div bgcolor="#FFFFFF"
                                            text="#000000"><span>
                                              <p><br>
                                              </p>
                                              <div
                                                class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
                                                09/11/2017 12:26 PM,
                                                Adam Nemet wrote:<br>
                                              </div>
                                            </span>
                                            <blockquote type="cite"><span>
                                                Hi Hal, Tobias, Michael
                                                and others, </span>
                                              <div><b>...</b><span>
                                                  <div>
                                                    <div><br>
                                                    </div>
                                                    <div>One thing that
                                                      I’d like to see
                                                      more details on is
                                                      what this means
                                                      for the evolution
                                                      of loop
                                                      transformations in
                                                      LLVM.</div>
                                                    <div><br>
                                                    </div>
                                                    <div>Our
                                                      more-or-less
                                                      established
                                                      direction was so
                                                      far to
                                                      incrementally
                                                      improve and
                                                      generalize the
                                                      required analyses
                                                      (e.g. the
                                                      LoopVectorizer’s
                                                      dependence
                                                      analysis + loop
                                                      versioning
                                                      analysis into a
                                                      stand-alone
                                                      analysis pass
                                                      (LoopAccessAnalysis))
                                                      and then build new
                                                      transformations
                                                      (e.g.
                                                      LoopDistribution,
LoopLoadElimination, LICMLoopVersioning) on top of these.</div>
                                                    <div><br>
                                                    </div>
                                                    <div>The idea was
                                                      that
                                                      infrastructure
                                                      would be
                                                      incrementally
                                                      improved from two
                                                      directions:</div>
                                                    <div><br>
                                                    </div>
                                                    <div>- As new
                                                      transformations
                                                      are built analyses
                                                      have to be
                                                      improved (e.g.
                                                      past improvements
                                                      to LAA to support
                                                      the LoopVersioning
                                                      utility, future
                                                      improvements for
                                                      full LoopSROA
                                                      beyond just
                                                      store->load
                                                      forwarding [1] or
                                                      the improvements
                                                      to LAA for the
                                                      LoopFusion
                                                      proposal[2])</div>
                                                    <div><br>
                                                    </div>
                                                    <div>- As more
                                                      complex loops
                                                      would have to be
                                                      analyzed we either
                                                      improve LAA or
                                                      make
                                                      DependenceAnalysis
                                                      a drop-in
                                                      replacement for
                                                      the memory
                                                      analysis part in
                                                      LAA</div>
                                                  </div>
                                                </span></div>
                                            </blockquote>
                                            <span> <br>
                                              Or we could use Polly's
                                              dependence analysis, which
                                              I believe to be more
                                              powerful, more robust, and
                                              more correct than
                                              DependenceAnalysis. I
                                              believe that the difficult
                                              part here is actually the
                                              pairing with predicated
                                              SCEV or whatever mechanism
                                              we want to use generate
                                              runtime predicates (this
                                              applies to use of
                                              DependenceAnalysis too).<br>
                                            </span></div>
                                        </div>
                                      </blockquote>
                                      <span>
                                        <div><br>
                                        </div>
                                        What is a good way to measure
                                        these assertions (More powerful,
                                        more robust)? Are you saying the
                                        LLVM Dependence Analysis is
                                        incorrect or do you actually
                                        mean less conservative (or "more
                                        accurate" or something like
                                        that)?<br>
                                      </span></div>
                                  </div>
                                </blockquote>
                                <br>
                                Sebastian's email covers the issues with
                                the DependenceAnalysis pass pretty well.<br>
                                <br>
                                Regarding what's in LoopAccessAnalysis,
                                I believe it to be correct, but more
                                limited. It is not clear to me that LAA
                                is bad at what it does based on what the
                                vectorizer can handle. LAA could do
                                better in some cases with
                                non-unit-stride loops. Polly also
                                handles piecewise-affine functions,
                                which allows the modeling of loops with
                                conditionals. Extending LAA to handle
                                loop nests, moreover, seems likely to be
                                non-trivial.<br>
                                <br>
                                Regardless, measuring these differences
                                certainly seems like a good idea. I
                                think that we can do this using
                                optimization remarks. LAA already emits
                                optimization remarks for loops in which
                                it finds unsafe memory dependencies.
                                Polly also emits optimization remarks.
                                We may need to iterate some in order to
                                setup a good comparison, but we should
                                be able to collect statistics (and other
                                information) by compiling code using
                                -fsave-optimization-record (in
                                combination with some other flags), and
                                then analyzing the resulting YAML files.<span><br>
                                  <br>
                                  <blockquote type="cite">
                                    <div dir="auto"
                                      style="word-wrap:break-word;line-break:after-white-space">
                                      <div>
                                        <blockquote type="cite">
                                          <div>
                                            <div bgcolor="#FFFFFF"
                                              text="#000000"> <br>
                                              <blockquote type="cite">
                                                <div>
                                                  <div>
                                                    <div><br>
                                                    </div>
                                                    <div>While this
                                                      model may be slow
                                                      it has all the
                                                      benefits of the
                                                      incremental
                                                      development model.</div>
                                                  </div>
                                                </div>
                                              </blockquote>
                                              <br>
                                              The current model may have
                                              been slow in many areas,
                                              but I think that's mostly
                                              a question of development
                                              effort. My largest concern
                                              about the current model is
                                              that, to the extent that
                                              we're implementing classic
                                              loop transformations
                                              (e.g., fusion,
                                              distribution, interchange,
                                              skewing, tiling, and so
                                              on), we're repeating a
                                              historical design that is
                                              known to have several
                                              suboptimal properties.
                                              Chief among them is the
                                              lack of integration: many
                                              of these transformations
                                              are interconnected, and
                                              there's no good pass
                                              ordering in which to make
                                              independent decisions.
                                              Many of these
                                              transformations can be
                                              captured in a single model
                                              and we can get much better
                                              results by integrating
                                              them. There's also the
                                              matter of whether building
                                              these transformation on
                                              SCEV (or IR directly) is
                                              the best underlying
                                              infrastructure, or whether
                                              parts of Polly would be
                                              better.<br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        <div><br>
                                        </div>
                                        I believe that is true. What I
                                        wonder is is there a good method
                                        to reason about it?</div>
                                    </div>
                                  </blockquote>
                                  <br>
                                </span> If I understand what you mean,
                                one way to look at it is this: This is
                                not a canonicalization problem. Picking
                                an optimal way to interchange loops may
                                depend on how the result can be skewed
                                and/or tiled, picking an optimal way to
                                distribute loops often depends on what
                                can be done afterward in each piece.
                                Optimal here generally involves
                                reasoning about the memory hierarchy
                                (e.g., cache properties), available
                                prefetching streams, register-file size,
                                and so on.<br>
                                <br>
                                I know that I've seen some good examples
                                in papers over the years that illustrate
                                the phase-ordering challenges.
                                Hopefully, someone will jump in here
                                with some good references. One classic
                                one is: William Pugh. Uniform Techniques
                                for Loop Optimization. 1991.<span><br>
                                  <br>
                                  <blockquote type="cite">
                                    <div dir="auto"
                                      style="word-wrap:break-word;line-break:after-white-space">
                                      <div> Perhaps concrete examples or
                                        perhaps opt-viewer based
                                        comparisons on large sets of
                                        benchmarks? In the big picture
                                        you could make such a modeling
                                        argument for all compiler
                                        optimizations.<br>
                                      </div>
                                    </div>
                                  </blockquote>
                                  <br>
                                </span> Certainly. However, in this case
                                there's a well-studied unified model for
                                this set of optimizations known to
                                reduce phase-ordering effects. That's
                                not true in general.<span><br>
                                  <br>
                                  <blockquote type="cite">
                                    <div dir="auto"
                                      style="word-wrap:break-word;line-break:after-white-space">
                                      <div>
                                        <blockquote type="cite">
                                          <div>
                                            <div bgcolor="#FFFFFF"
                                              text="#000000"> <br>
                                              That having been said, I
                                              think that integrating
                                              this technology into LLVM
                                              will also mean applying
                                              appropriate modularity. I
                                              think that we'll almost
                                              definitely want to make
                                              use of the dependence
                                              analysis separately as an
                                              analysis. We'll want to
                                              decide which of these
                                              transformations will be
                                              considered
                                              canonicalization (and run
                                              in the iterative pipeline)
                                              and which will be lowering
                                              (and run near the
                                              vectorizer). LoopSROA
                                              certainly sounds to me
                                              like canonicalization, but
                                              loop fusion might also
                                              fall into that category
                                              (i.e., we might want to
                                              fuse early to enable
                                              optimizations and then
                                              split late).<br>
                                              <br>
                                              <blockquote type="cite">
                                                <div>
                                                  <div>
                                                    <div><br>
                                                    </div>
                                                    <div>Then there is
                                                      the question of
                                                      use cases.  It’s
                                                      fairly obvious
                                                      that anybody
                                                      wanting to
                                                      optimize a 5-deep
                                                      highly regular
                                                      loop-nest
                                                      operating on
                                                      arrays should use
                                                      Polly.  On the
                                                      other hand it’s
                                                      way less clear
                                                      that we should use
                                                      it for singly or
                                                      doubly nested
                                                      not-so-regular
                                                      loops which are
                                                      the norm in
                                                      non-HPC workloads.</div>
                                                  </div>
                                                </div>
                                              </blockquote>
                                              <br>
                                              This is clearly a good
                                              question, but thinking
                                              about Polly as a set of
                                              components, not as a
                                              monolithic transformation
                                              component, I think that
                                              polyhedral analysis and
                                              transformations can
                                              underlie a lot of the
                                              transformations we need
                                              for non-HPC code (and,
                                              which I'll point out, we
                                              need for modern HPC code
                                              too). In practice, the
                                              loops that we can actually
                                              analyze have affine
                                              dependencies, and Polly
                                              does, or can do, a better
                                              job at generating runtime
                                              predicates and dealing
                                              with piecewise-linear
                                              expressions than our
                                              current infrastructure.<br>
                                              <br>
                                              In short, I look at Polly
                                              as two things: First, an
                                              infrastructure for dealing
                                              with loop analysis and
                                              transformation. I view
                                              this as being broadly
                                              applicable. Second, an
                                              application of that to
                                              apply cost-model-driven
                                              classic loop
                                              transformations. To some
                                              extent this is going to be
                                              more useful for HPC codes,
                                              but also applies to
                                              machine learning, signal
                                              processing, graphics, and
                                              other areas. <br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        I’m wondering if it could be
                                        used for pointing out headroom
                                        for the existing LLVM ecosystem
                                        (*)<br>
                                      </div>
                                    </div>
                                  </blockquote>
                                  <br>
                                </span> Sure.<span><br>
                                  <br>
                                  <blockquote type="cite">
                                    <div dir="auto"
                                      style="word-wrap:break-word;line-break:after-white-space">
                                      <div><br>
                                        <blockquote type="cite">
                                          <div>
                                            <div bgcolor="#FFFFFF"
                                              text="#000000"> <br>
                                              <blockquote type="cite">
                                                <div>
                                                  <div>
                                                    <div><br>
                                                    </div>
                                                    <div>And this brings
                                                      me to the
                                                      maintenance
                                                      question.  Is it
                                                      reasonable to
                                                      expect people to
                                                      fix Polly when
                                                      they have a
                                                      seemingly
                                                      unrelated change
                                                      that happens to
                                                      break a Polly bot.</div>
                                                  </div>
                                                </div>
                                              </blockquote>
                                              <br>
                                              The eventual goal here is
                                              to have this technology in
                                              appropriate parts of the
                                              main pipeline, and so the
                                              question here is not
                                              really about breaking a
                                              "Polly bot", but just
                                              about a "bot" in general.
                                              I've given this question
                                              some thought and I think
                                              it sits in a reasonable
                                              place in the risk-reward
                                              space. The answer would
                                              be, yes, we'd need to
                                              treat this like any other
                                              part of the pipeline.
                                              However, I believe that
                                              Polly has as many, or
                                              more, active contributors
                                              than essentially any other
                                              individual part of the
                                              mid-level optimizer or
                                              CodeGen. As a result,
                                              there will be people
                                              around in many time zones
                                              to help with problems with
                                              Polly-related code.<br>
                                              <br>
                                              <blockquote type="cite">
                                                <div>
                                                  <div>
                                                    <div>  As far as I
                                                      know, there were
                                                      companies in the
                                                      past that tried
                                                      Polly without a
                                                      whole lot of prior
                                                      experience.  It
                                                      would be great to
                                                      hear what the
                                                      experience was
                                                      before adopting
                                                      Polly at a much
                                                      larger scale.</div>
                                                  </div>
                                                </div>
                                              </blockquote>
                                              <br>
                                              I'm also interested,
                                              although I'll caution
                                              against over-interpreting
                                              any evidence here
                                              (positive or negative).
                                              Before a few weeks ago,
                                              Polly didn't effectively
                                              run in the pipeline after
                                              inlining, and so I doubt
                                              it would have been much
                                              use outside of embedded
                                              environments (and maybe
                                              some HPC environments)
                                              with
                                              straightforwardly-presented
                                              C code. It's only now that
                                              this has been fixed that I
                                              find the possibility of
                                              integrating this in
                                              production interesting.<br>
                                            </div>
                                          </div>
                                        </blockquote>
                                        <div><br>
                                        </div>
                                        That is a good point. There are
                                        also biases independent of past
                                        experiences (for disclosure mine
                                        is (*) above). But I think it is
                                        objective to say a Polly
                                        integration is a big piece to
                                        swallow.Your pro-Polly argument
                                        lists a number of categories
                                        that I think could be reasoned
                                        about individually and partly
                                        evaluated with a data-driven
                                        approach:</div>
                                      <div>A) Architecture</div>
                                      <div>- support for autoparallelism</div>
                                      <div>- support for accelerators</div>
                                      <div>- isl- rewrite? etc</div>
                                      <div>...</div>
                                      <div>B) Modelling</div>
                                      <div>- polyhedral model</div>
                                      <div>- temporal locality</div>
                                      <div>- spatial locality </div>
                                      <div>…</div>
                                      <div>C) Analysis/Optimizations</div>
                                      <div>- Dependence Analysis</div>
                                      <div>- Transformation
                                        effective/power (loop nests,
                                        quality of transformations,
                                        #vectorizable loops etc)</div>
                                      <div><br>
                                      </div>
                                      <div>A) is mostly Polly
                                        independent (except for the isl
                                        question I guess). For B and C
                                        performance/ compile-time
                                        /opt-viewer data on a
                                        decent/wide range of benchmarks
                                        possibly at different
                                        optimization levels (O2, O3,
                                        LTO, PGO etc and combinations)
                                        should provide data-driven
                                        insight into costs/benefits. <br>
                                      </div>
                                    </div>
                                  </blockquote>
                                  <br>
                                </span> I agree. In practice, the first
                                question is: Are will willing to take on
                                Polly (+isl), in whole or in part, as a
                                build dependency? If the answer is yes,
                                the next question is: what parts should
                                be reused or refactored for use in other
                                parts of the pipeline? My argument is
                                that we should take on Polly, or most of
                                it, as a build dependency. Work on
                                better unifying the developer
                                communities as we start experimenting
                                with other kinds of integration. This
                                will, however, allow us to provide to
                                all of our users these transformations
                                through pragmas (and other kinds of
                                optional enablement). This is an
                                important first step.<br>
                              </div>
                            </blockquote>
                            <div><br>
                            </div>
                            <div>When you write that we should take
                              Polly as a build dependency: are you
                              envisioning this to be tied into LLVM to
                              the point where we can't build LLVM
                              without Polly?</div>
                            <div>I thought the approach would rather be
                              that a default build of LLVM would have
                              Polly available but that there will
                              "forever" be a `-DENABLE_POLLY=OFF`
                              option?</div>
                            <div>This seems like an important
                              distinction to me, as making it more
                              integrated but still optional for the
                              correct behavior of LLVM means that people
                              can continue to work and maintain
                              "forever" an optimization pipeline which
                              operates without Polly, while starting to
                              get more integration in a pluggable form.
                              I believe this is the direction that was
                              taken with the GSoC last year, trying to
                              get Polly as a "pluggable" dependency
                              analysis.</div>
                          </div>
                        </div>
                      </div>
                    </blockquote>
                    <br>
                  </div>
                </div>
                I don't want to make statements about "forever":</div>
            </blockquote>
            <div><br>
            </div>
            <div>Right, I forgot the asterisk to explain my quotes
              around "forever": I meant by this "not until we have
              compelling data and agreement to do otherwise". I think
              we're on agreement on this though :)</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I think we are as well :-)<br>
    <br>
    <blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> eventually, I'd
                like to have strong loop optimizations as an integrated
                part of LLVM. At that point, "Polly" might be little
                more than a shorthand for some parts of LLVM. Now LLVM
                currently has an issue whereby it's difficult to build
                customized versions with only a subset of needed passes,
                but that's an orthogonal issue. In the short term, it
                would certainly be separable. I believe that we'd
                certainly keep the `-DENABLE_POLLY=OFF` option as long
                as we felt it made sense.<br>
                <br>
                By build dependency, I mean: The sources are part of the
                main source tree, they're built by default, thus
                essentially all of the builders build them, and they'll
                end up in many distributions and derived products. This
                will allow us to build on what's there in whatever ways
                make sense.</div>
            </blockquote>
            <div><br>
            </div>
            <div>Thanks for clarifying!</div>
            <div><br>
            </div>
            <div>  </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"><span class="">
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div class="gmail_extra">
                        <div class="gmail_quote">
                          <blockquote class="gmail_quote"
                            style="margin:0 0 0 .8ex;border-left:1px
                            #ccc solid;padding-left:1ex">
                            <div bgcolor="#FFFFFF" text="#000000"> <br>
                              I'm not sure exactly how good this is, but
                              polly has LNT-submitting bots, so the
                              website can generate a comparison (e.g., <a
                                moz-do-not-send="true"
class="m_-8232648019342385258m_-2166750394864443591moz-txt-link-freetext"
href="http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182"
                                target="_blank">http://lnt.llvm.org/db_default<wbr>/v4/nts/71208?compare_to=71182</a><wbr>).
                              Looking at this comparison shows a number
                              of potential problems but also cases where
                              Polly really helps (and, FWIW, the largest
                              two compile-time regressions are also
                              associated with very large execution
                              performance improvements). My first focus
                              would certainly be on pragma-driven
                              enablement.<br>
                            </div>
                          </blockquote>
                          <div><br>
                          </div>
                          <div>I'll add: PGO-driven enablement: spending
                            compile-time where we know it can have an
                            impact. What do you think?</div>
                        </div>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span> Certainly (although, obviously, we still need
                reasonable confidence that it won't cause performance
                regressions to do that).<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>Oh right, I meant it through an opt-in flag to begin
              with (-fenable-pgo-polly or something like that).</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    SGTM.<br>
    <br>
     -Hal<br>
    <br>
    <blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>-- </div>
            <div>Mehdi</div>
            <div><br>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </body>
</html>