<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 09/22/2017 12:03 AM, Mehdi AMINI
      wrote:<br>
    </div>
    <blockquote
cite="mid:CANF-O=aZc5Ouoa9OELEUaHhfJsB19+dD71JWQOtJrF3eH1TXDQ@mail.gmail.com"
      type="cite">
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
      <div dir="ltr">Hi Hal,
        <div><br>
        </div>
        <div class="gmail_extra"><br>
          <div class="gmail_quote">2017-09-21 20:59 GMT-07:00 Hal Finkel
            via llvm-dev <span dir="ltr"><<a moz-do-not-send="true"
                href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span>:<br>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"><span class=""> <br>
                  <div class="m_-2166750394864443591moz-cite-prefix">On
                    09/12/2017 10:26 PM, Gerolf Hoflehner wrote:<br>
                  </div>
                </span>
                <blockquote type="cite">
                  <div dir="auto"
                    style="word-wrap:break-word;line-break:after-white-space"><br>
                    <div><br>
                      <blockquote type="cite"><span class="">
                          <div>On Sep 11, 2017, at 10:47 PM, Hal Finkel
                            via llvm-dev <<a moz-do-not-send="true"
                              href="mailto:llvm-dev@lists.llvm.org"
                              target="_blank">llvm-dev@lists.llvm.org</a>>
                            wrote:</div>
                          <br
                            class="m_-2166750394864443591Apple-interchange-newline">
                        </span>
                        <div>
                          <div bgcolor="#FFFFFF" text="#000000"><span
                              class="">
                              <p><br>
                              </p>
                              <div
                                class="m_-2166750394864443591moz-cite-prefix">On
                                09/11/2017 12:26 PM, Adam Nemet wrote:<br>
                              </div>
                            </span>
                            <blockquote type="cite"><span class=""> Hi
                                Hal, Tobias, Michael and others, </span>
                              <div><b>...</b><span class="">
                                  <div>
                                    <div><br>
                                    </div>
                                    <div>One thing that I’d like to see
                                      more details on is what this means
                                      for the evolution of loop
                                      transformations in LLVM.</div>
                                    <div><br>
                                    </div>
                                    <div>Our more-or-less established
                                      direction was so far to
                                      incrementally improve and
                                      generalize the required analyses
                                      (e.g. the LoopVectorizer’s
                                      dependence analysis + loop
                                      versioning analysis into a
                                      stand-alone analysis pass
                                      (LoopAccessAnalysis)) and then
                                      build new transformations (e.g.
                                      LoopDistribution,
                                      LoopLoadElimination,
                                      LICMLoopVersioning) on top of
                                      these.</div>
                                    <div><br>
                                    </div>
                                    <div>The idea was that
                                      infrastructure would be
                                      incrementally improved from two
                                      directions:</div>
                                    <div><br>
                                    </div>
                                    <div>- As new transformations are
                                      built analyses have to be improved
                                      (e.g. past improvements to LAA to
                                      support the LoopVersioning
                                      utility, future improvements for
                                      full LoopSROA beyond just
                                      store->load forwarding [1] or
                                      the improvements to LAA for the
                                      LoopFusion proposal[2])</div>
                                    <div><br>
                                    </div>
                                    <div>- As more complex loops would
                                      have to be analyzed we either
                                      improve LAA or make
                                      DependenceAnalysis a drop-in
                                      replacement for the memory
                                      analysis part in LAA</div>
                                  </div>
                                </span></div>
                            </blockquote>
                            <span class=""> <br>
                              Or we could use Polly's dependence
                              analysis, which I believe to be more
                              powerful, more robust, and more correct
                              than DependenceAnalysis. I believe that
                              the difficult part here is actually the
                              pairing with predicated SCEV or whatever
                              mechanism we want to use generate runtime
                              predicates (this applies to use of
                              DependenceAnalysis too).<br>
                            </span></div>
                        </div>
                      </blockquote>
                      <span class="">
                        <div><br>
                        </div>
                        What is a good way to measure these assertions
                        (More powerful, more robust)? Are you saying the
                        LLVM Dependence Analysis is incorrect or do you
                        actually mean less conservative (or "more
                        accurate" or something like that)?<br>
                      </span></div>
                  </div>
                </blockquote>
                <br>
                Sebastian's email covers the issues with the
                DependenceAnalysis pass pretty well.<br>
                <br>
                Regarding what's in LoopAccessAnalysis, I believe it to
                be correct, but more limited. It is not clear to me that
                LAA is bad at what it does based on what the vectorizer
                can handle. LAA could do better in some cases with
                non-unit-stride loops. Polly also handles
                piecewise-affine functions, which allows the modeling of
                loops with conditionals. Extending LAA to handle loop
                nests, moreover, seems likely to be non-trivial.<br>
                <br>
                Regardless, measuring these differences certainly seems
                like a good idea. I think that we can do this using
                optimization remarks. LAA already emits optimization
                remarks for loops in which it finds unsafe memory
                dependencies. Polly also emits optimization remarks. We
                may need to iterate some in order to setup a good
                comparison, but we should be able to collect statistics
                (and other information) by compiling code using
                -fsave-optimization-record (in combination with some
                other flags), and then analyzing the resulting YAML
                files.<span class=""><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="auto"
                      style="word-wrap:break-word;line-break:after-white-space">
                      <div>
                        <blockquote type="cite">
                          <div>
                            <div bgcolor="#FFFFFF" text="#000000"> <br>
                              <blockquote type="cite">
                                <div>
                                  <div>
                                    <div><br>
                                    </div>
                                    <div>While this model may be slow it
                                      has all the benefits of the
                                      incremental development model.</div>
                                  </div>
                                </div>
                              </blockquote>
                              <br>
                              The current model may have been slow in
                              many areas, but I think that's mostly a
                              question of development effort. My largest
                              concern about the current model is that,
                              to the extent that we're implementing
                              classic loop transformations (e.g.,
                              fusion, distribution, interchange,
                              skewing, tiling, and so on), we're
                              repeating a historical design that is
                              known to have several suboptimal
                              properties. Chief among them is the lack
                              of integration: many of these
                              transformations are interconnected, and
                              there's no good pass ordering in which to
                              make independent decisions. Many of these
                              transformations can be captured in a
                              single model and we can get much better
                              results by integrating them. There's also
                              the matter of whether building these
                              transformation on SCEV (or IR directly) is
                              the best underlying infrastructure, or
                              whether parts of Polly would be better.<br>
                            </div>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        I believe that is true. What I wonder is is
                        there a good method to reason about it?</div>
                    </div>
                  </blockquote>
                  <br>
                </span> If I understand what you mean, one way to look
                at it is this: This is not a canonicalization problem.
                Picking an optimal way to interchange loops may depend
                on how the result can be skewed and/or tiled, picking an
                optimal way to distribute loops often depends on what
                can be done afterward in each piece. Optimal here
                generally involves reasoning about the memory hierarchy
                (e.g., cache properties), available prefetching streams,
                register-file size, and so on.<br>
                <br>
                I know that I've seen some good examples in papers over
                the years that illustrate the phase-ordering challenges.
                Hopefully, someone will jump in here with some good
                references. One classic one is: William Pugh. Uniform
                Techniques for Loop Optimization. 1991.<span class=""><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="auto"
                      style="word-wrap:break-word;line-break:after-white-space">
                      <div> Perhaps concrete examples or perhaps
                        opt-viewer based comparisons on large sets of
                        benchmarks? In the big picture you could make
                        such a modeling argument for all compiler
                        optimizations.<br>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span> Certainly. However, in this case there's a
                well-studied unified model for this set of optimizations
                known to reduce phase-ordering effects. That's not true
                in general.<span class=""><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="auto"
                      style="word-wrap:break-word;line-break:after-white-space">
                      <div>
                        <blockquote type="cite">
                          <div>
                            <div bgcolor="#FFFFFF" text="#000000"> <br>
                              That having been said, I think that
                              integrating this technology into LLVM will
                              also mean applying appropriate modularity.
                              I think that we'll almost definitely want
                              to make use of the dependence analysis
                              separately as an analysis. We'll want to
                              decide which of these transformations will
                              be considered canonicalization (and run in
                              the iterative pipeline) and which will be
                              lowering (and run near the vectorizer).
                              LoopSROA certainly sounds to me like
                              canonicalization, but loop fusion might
                              also fall into that category (i.e., we
                              might want to fuse early to enable
                              optimizations and then split late).<br>
                              <br>
                              <blockquote type="cite">
                                <div>
                                  <div>
                                    <div><br>
                                    </div>
                                    <div>Then there is the question of
                                      use cases.  It’s fairly obvious
                                      that anybody wanting to optimize a
                                      5-deep highly regular loop-nest
                                      operating on arrays should use
                                      Polly.  On the other hand it’s way
                                      less clear that we should use it
                                      for singly or doubly nested
                                      not-so-regular loops which are the
                                      norm in non-HPC workloads.</div>
                                  </div>
                                </div>
                              </blockquote>
                              <br>
                              This is clearly a good question, but
                              thinking about Polly as a set of
                              components, not as a monolithic
                              transformation component, I think that
                              polyhedral analysis and transformations
                              can underlie a lot of the transformations
                              we need for non-HPC code (and, which I'll
                              point out, we need for modern HPC code
                              too). In practice, the loops that we can
                              actually analyze have affine dependencies,
                              and Polly does, or can do, a better job at
                              generating runtime predicates and dealing
                              with piecewise-linear expressions than our
                              current infrastructure.<br>
                              <br>
                              In short, I look at Polly as two things:
                              First, an infrastructure for dealing with
                              loop analysis and transformation. I view
                              this as being broadly applicable. Second,
                              an application of that to apply
                              cost-model-driven classic loop
                              transformations. To some extent this is
                              going to be more useful for HPC codes, but
                              also applies to machine learning, signal
                              processing, graphics, and other areas. <br>
                            </div>
                          </div>
                        </blockquote>
                        I’m wondering if it could be used for pointing
                        out headroom for the existing LLVM ecosystem (*)<br>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span> Sure.<span class=""><br>
                  <br>
                  <blockquote type="cite">
                    <div dir="auto"
                      style="word-wrap:break-word;line-break:after-white-space">
                      <div><br>
                        <blockquote type="cite">
                          <div>
                            <div bgcolor="#FFFFFF" text="#000000"> <br>
                              <blockquote type="cite">
                                <div>
                                  <div>
                                    <div><br>
                                    </div>
                                    <div>And this brings me to the
                                      maintenance question.  Is it
                                      reasonable to expect people to fix
                                      Polly when they have a seemingly
                                      unrelated change that happens to
                                      break a Polly bot.</div>
                                  </div>
                                </div>
                              </blockquote>
                              <br>
                              The eventual goal here is to have this
                              technology in appropriate parts of the
                              main pipeline, and so the question here is
                              not really about breaking a "Polly bot",
                              but just about a "bot" in general. I've
                              given this question some thought and I
                              think it sits in a reasonable place in the
                              risk-reward space. The answer would be,
                              yes, we'd need to treat this like any
                              other part of the pipeline. However, I
                              believe that Polly has as many, or more,
                              active contributors than essentially any
                              other individual part of the mid-level
                              optimizer or CodeGen. As a result, there
                              will be people around in many time zones
                              to help with problems with Polly-related
                              code.<br>
                              <br>
                              <blockquote type="cite">
                                <div>
                                  <div>
                                    <div>  As far as I know, there were
                                      companies in the past that tried
                                      Polly without a whole lot of prior
                                      experience.  It would be great to
                                      hear what the experience was
                                      before adopting Polly at a much
                                      larger scale.</div>
                                  </div>
                                </div>
                              </blockquote>
                              <br>
                              I'm also interested, although I'll caution
                              against over-interpreting any evidence
                              here (positive or negative). Before a few
                              weeks ago, Polly didn't effectively run in
                              the pipeline after inlining, and so I
                              doubt it would have been much use outside
                              of embedded environments (and maybe some
                              HPC environments) with
                              straightforwardly-presented C code. It's
                              only now that this has been fixed that I
                              find the possibility of integrating this
                              in production interesting.<br>
                            </div>
                          </div>
                        </blockquote>
                        <div><br>
                        </div>
                        That is a good point. There are also biases
                        independent of past experiences (for disclosure
                        mine is (*) above). But I think it is objective
                        to say a Polly integration is a big piece to
                        swallow.Your pro-Polly argument lists a number
                        of categories that I think could be reasoned
                        about individually and partly evaluated with a
                        data-driven approach:</div>
                      <div>A) Architecture</div>
                      <div>- support for autoparallelism</div>
                      <div>- support for accelerators</div>
                      <div>- isl- rewrite? etc</div>
                      <div>...</div>
                      <div>B) Modelling</div>
                      <div>- polyhedral model</div>
                      <div>- temporal locality</div>
                      <div>- spatial locality </div>
                      <div>…</div>
                      <div>C) Analysis/Optimizations</div>
                      <div>- Dependence Analysis</div>
                      <div>- Transformation effective/power (loop nests,
                        quality of transformations, #vectorizable loops
                        etc)</div>
                      <div><br>
                      </div>
                      <div>A) is mostly Polly independent (except for
                        the isl question I guess). For B and C
                        performance/ compile-time /opt-viewer data on a
                        decent/wide range of benchmarks possibly at
                        different optimization levels (O2, O3, LTO, PGO
                        etc and combinations) should provide data-driven
                        insight into costs/benefits. <br>
                      </div>
                    </div>
                  </blockquote>
                  <br>
                </span> I agree. In practice, the first question is: Are
                will willing to take on Polly (+isl), in whole or in
                part, as a build dependency? If the answer is yes, the
                next question is: what parts should be reused or
                refactored for use in other parts of the pipeline? My
                argument is that we should take on Polly, or most of it,
                as a build dependency. Work on better unifying the
                developer communities as we start experimenting with
                other kinds of integration. This will, however, allow us
                to provide to all of our users these transformations
                through pragmas (and other kinds of optional
                enablement). This is an important first step.<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>When you write that we should take Polly as a build
              dependency: are you envisioning this to be tied into LLVM
              to the point where we can't build LLVM without Polly?</div>
            <div>I thought the approach would rather be that a default
              build of LLVM would have Polly available but that there
              will "forever" be a `-DENABLE_POLLY=OFF` option?</div>
            <div>This seems like an important distinction to me, as
              making it more integrated but still optional for the
              correct behavior of LLVM means that people can continue to
              work and maintain "forever" an optimization pipeline which
              operates without Polly, while starting to get more
              integration in a pluggable form. I believe this is the
              direction that was taken with the GSoC last year, trying
              to get Polly as a "pluggable" dependency analysis.</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    I don't want to make statements about "forever": eventually, I'd
    like to have strong loop optimizations as an integrated part of
    LLVM. At that point, "Polly" might be little more than a shorthand
    for some parts of LLVM. Now LLVM currently has an issue whereby it's
    difficult to build customized versions with only a subset of needed
    passes, but that's an orthogonal issue. In the short term, it would
    certainly be separable. I believe that we'd certainly keep the
    `-DENABLE_POLLY=OFF` option as long as we felt it made sense.<br>
    <br>
    By build dependency, I mean: The sources are part of the main source
    tree, they're built by default, thus essentially all of the builders
    build them, and they'll end up in many distributions and derived
    products. This will allow us to build on what's there in whatever
    ways make sense.<br>
    <br>
    <blockquote
cite="mid:CANF-O=aZc5Ouoa9OELEUaHhfJsB19+dD71JWQOtJrF3eH1TXDQ@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div> </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">
              <div bgcolor="#FFFFFF" text="#000000"> <br>
                I'm not sure exactly how good this is, but polly has
                LNT-submitting bots, so the website can generate a
                comparison (e.g., <a moz-do-not-send="true"
                  class="m_-2166750394864443591moz-txt-link-freetext"
                  href="http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182"
                  target="_blank">http://lnt.llvm.org/db_<wbr>default/v4/nts/71208?compare_<wbr>to=71182</a>).
                Looking at this comparison shows a number of potential
                problems but also cases where Polly really helps (and,
                FWIW, the largest two compile-time regressions are also
                associated with very large execution performance
                improvements). My first focus would certainly be on
                pragma-driven enablement.<br>
              </div>
            </blockquote>
            <div><br>
            </div>
            <div>I'll add: PGO-driven enablement: spending compile-time
              where we know it can have an impact. What do you think?</div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    Certainly (although, obviously, we still need reasonable confidence
    that it won't cause performance regressions to do that).<br>
    <br>
     -Hal<br>
    <br>
    <blockquote
cite="mid:CANF-O=aZc5Ouoa9OELEUaHhfJsB19+dD71JWQOtJrF3eH1TXDQ@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div class="gmail_extra">
          <div class="gmail_quote">
            <div><br>
            </div>
            <div>Best,</div>
            <div><br>
            </div>
            <div>-- </div>
            <div>Mehdi</div>
            <div><br>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
    <pre class="moz-signature" cols="72">-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
  </body>
</html>