<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2017-09-21 22:22 GMT-07:00 Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

  <div bgcolor="#FFFFFF" text="#000000"><div><div class="h5">

    <p><br>

    </p>

    <div class="m_-8232648019342385258moz-cite-prefix">On 09/22/2017 12:03 AM, Mehdi AMINI

      wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">Hi Hal,

        <div><br>

        </div>

        <div class="gmail_extra"><br>

          <div class="gmail_quote">2017-09-21 20:59 GMT-07:00 Hal Finkel

            via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span>:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"><span> <br>

                  <div class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On

                    09/12/2017 10:26 PM, Gerolf Hoflehner wrote:<br>

                  </div>

                </span>

                <blockquote type="cite">

                  <div dir="auto" style="word-wrap:break-word;line-break:after-white-space"><br>

                    <div><br>

                      <blockquote type="cite"><span>

                          <div>On Sep 11, 2017, at 10:47 PM, Hal Finkel

                            via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>

                            wrote:</div>

                          <br class="m_-8232648019342385258m_-2166750394864443591Apple-interchange-newline">

                        </span>

                        <div>

                          <div bgcolor="#FFFFFF" text="#000000"><span>

                              <p><br>

                              </p>

                              <div class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On

                                09/11/2017 12:26 PM, Adam Nemet wrote:<br>

                              </div>

                            </span>

                            <blockquote type="cite"><span> Hi

                                Hal, Tobias, Michael and others, </span>

                              <div><b>...</b><span>

                                  <div>

                                    <div><br>

                                    </div>

                                    <div>One thing that I’d like to see

                                      more details on is what this means

                                      for the evolution of loop

                                      transformations in LLVM.</div>

                                    <div><br>

                                    </div>

                                    <div>Our more-or-less established

                                      direction was so far to

                                      incrementally improve and

                                      generalize the required analyses

                                      (e.g. the LoopVectorizer’s

                                      dependence analysis + loop

                                      versioning analysis into a

                                      stand-alone analysis pass

                                      (LoopAccessAnalysis)) and then

                                      build new transformations (e.g.

                                      LoopDistribution,

                                      LoopLoadElimination,

                                      LICMLoopVersioning) on top of

                                      these.</div>

                                    <div><br>

                                    </div>

                                    <div>The idea was that

                                      infrastructure would be

                                      incrementally improved from two

                                      directions:</div>

                                    <div><br>

                                    </div>

                                    <div>- As new transformations are

                                      built analyses have to be improved

                                      (e.g. past improvements to LAA to

                                      support the LoopVersioning

                                      utility, future improvements for

                                      full LoopSROA beyond just

                                      store->load forwarding [1] or

                                      the improvements to LAA for the

                                      LoopFusion proposal[2])</div>

                                    <div><br>

                                    </div>

                                    <div>- As more complex loops would

                                      have to be analyzed we either

                                      improve LAA or make

                                      DependenceAnalysis a drop-in

                                      replacement for the memory

                                      analysis part in LAA</div>

                                  </div>

                                </span></div>

                            </blockquote>

                            <span> <br>

                              Or we could use Polly's dependence

                              analysis, which I believe to be more

                              powerful, more robust, and more correct

                              than DependenceAnalysis. I believe that

                              the difficult part here is actually the

                              pairing with predicated SCEV or whatever

                              mechanism we want to use generate runtime

                              predicates (this applies to use of

                              DependenceAnalysis too).<br>

                            </span></div>

                        </div>

                      </blockquote>

                      <span>

                        <div><br>

                        </div>

                        What is a good way to measure these assertions

                        (More powerful, more robust)? Are you saying the

                        LLVM Dependence Analysis is incorrect or do you

                        actually mean less conservative (or "more

                        accurate" or something like that)?<br>

                      </span></div>

                  </div>

                </blockquote>

                <br>

                Sebastian's email covers the issues with the

                DependenceAnalysis pass pretty well.<br>

                <br>

                Regarding what's in LoopAccessAnalysis, I believe it to

                be correct, but more limited. It is not clear to me that

                LAA is bad at what it does based on what the vectorizer

                can handle. LAA could do better in some cases with

                non-unit-stride loops. Polly also handles

                piecewise-affine functions, which allows the modeling of

                loops with conditionals. Extending LAA to handle loop

                nests, moreover, seems likely to be non-trivial.<br>

                <br>

                Regardless, measuring these differences certainly seems

                like a good idea. I think that we can do this using

                optimization remarks. LAA already emits optimization

                remarks for loops in which it finds unsafe memory

                dependencies. Polly also emits optimization remarks. We

                may need to iterate some in order to setup a good

                comparison, but we should be able to collect statistics

                (and other information) by compiling code using

                -fsave-optimization-record (in combination with some

                other flags), and then analyzing the resulting YAML

                files.<span><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="auto" style="word-wrap:break-word;line-break:after-white-space">

                      <div>

                        <blockquote type="cite">

                          <div>

                            <div bgcolor="#FFFFFF" text="#000000"> <br>

                              <blockquote type="cite">

                                <div>

                                  <div>

                                    <div><br>

                                    </div>

                                    <div>While this model may be slow it

                                      has all the benefits of the

                                      incremental development model.</div>

                                  </div>

                                </div>

                              </blockquote>

                              <br>

                              The current model may have been slow in

                              many areas, but I think that's mostly a

                              question of development effort. My largest

                              concern about the current model is that,

                              to the extent that we're implementing

                              classic loop transformations (e.g.,

                              fusion, distribution, interchange,

                              skewing, tiling, and so on), we're

                              repeating a historical design that is

                              known to have several suboptimal

                              properties. Chief among them is the lack

                              of integration: many of these

                              transformations are interconnected, and

                              there's no good pass ordering in which to

                              make independent decisions. Many of these

                              transformations can be captured in a

                              single model and we can get much better

                              results by integrating them. There's also

                              the matter of whether building these

                              transformation on SCEV (or IR directly) is

                              the best underlying infrastructure, or

                              whether parts of Polly would be better.<br>

                            </div>

                          </div>

                        </blockquote>

                        <div><br>

                        </div>

                        I believe that is true. What I wonder is is

                        there a good method to reason about it?</div>

                    </div>

                  </blockquote>

                  <br>

                </span> If I understand what you mean, one way to look

                at it is this: This is not a canonicalization problem.

                Picking an optimal way to interchange loops may depend

                on how the result can be skewed and/or tiled, picking an

                optimal way to distribute loops often depends on what

                can be done afterward in each piece. Optimal here

                generally involves reasoning about the memory hierarchy

                (e.g., cache properties), available prefetching streams,

                register-file size, and so on.<br>

                <br>

                I know that I've seen some good examples in papers over

                the years that illustrate the phase-ordering challenges.

                Hopefully, someone will jump in here with some good

                references. One classic one is: William Pugh. Uniform

                Techniques for Loop Optimization. 1991.<span><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="auto" style="word-wrap:break-word;line-break:after-white-space">

                      <div> Perhaps concrete examples or perhaps

                        opt-viewer based comparisons on large sets of

                        benchmarks? In the big picture you could make

                        such a modeling argument for all compiler

                        optimizations.<br>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> Certainly. However, in this case there's a

                well-studied unified model for this set of optimizations

                known to reduce phase-ordering effects. That's not true

                in general.<span><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="auto" style="word-wrap:break-word;line-break:after-white-space">

                      <div>

                        <blockquote type="cite">

                          <div>

                            <div bgcolor="#FFFFFF" text="#000000"> <br>

                              That having been said, I think that

                              integrating this technology into LLVM will

                              also mean applying appropriate modularity.

                              I think that we'll almost definitely want

                              to make use of the dependence analysis

                              separately as an analysis. We'll want to

                              decide which of these transformations will

                              be considered canonicalization (and run in

                              the iterative pipeline) and which will be

                              lowering (and run near the vectorizer).

                              LoopSROA certainly sounds to me like

                              canonicalization, but loop fusion might

                              also fall into that category (i.e., we

                              might want to fuse early to enable

                              optimizations and then split late).<br>

                              <br>

                              <blockquote type="cite">

                                <div>

                                  <div>

                                    <div><br>

                                    </div>

                                    <div>Then there is the question of

                                      use cases.  It’s fairly obvious

                                      that anybody wanting to optimize a

                                      5-deep highly regular loop-nest

                                      operating on arrays should use

                                      Polly.  On the other hand it’s way

                                      less clear that we should use it

                                      for singly or doubly nested

                                      not-so-regular loops which are the

                                      norm in non-HPC workloads.</div>

                                  </div>

                                </div>

                              </blockquote>

                              <br>

                              This is clearly a good question, but

                              thinking about Polly as a set of

                              components, not as a monolithic

                              transformation component, I think that

                              polyhedral analysis and transformations

                              can underlie a lot of the transformations

                              we need for non-HPC code (and, which I'll

                              point out, we need for modern HPC code

                              too). In practice, the loops that we can

                              actually analyze have affine dependencies,

                              and Polly does, or can do, a better job at

                              generating runtime predicates and dealing

                              with piecewise-linear expressions than our

                              current infrastructure.<br>

                              <br>

                              In short, I look at Polly as two things:

                              First, an infrastructure for dealing with

                              loop analysis and transformation. I view

                              this as being broadly applicable. Second,

                              an application of that to apply

                              cost-model-driven classic loop

                              transformations. To some extent this is

                              going to be more useful for HPC codes, but

                              also applies to machine learning, signal

                              processing, graphics, and other areas. <br>

                            </div>

                          </div>

                        </blockquote>

                        I’m wondering if it could be used for pointing

                        out headroom for the existing LLVM ecosystem (*)<br>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> Sure.<span><br>

                  <br>

                  <blockquote type="cite">

                    <div dir="auto" style="word-wrap:break-word;line-break:after-white-space">

                      <div><br>

                        <blockquote type="cite">

                          <div>

                            <div bgcolor="#FFFFFF" text="#000000"> <br>

                              <blockquote type="cite">

                                <div>

                                  <div>

                                    <div><br>

                                    </div>

                                    <div>And this brings me to the

                                      maintenance question.  Is it

                                      reasonable to expect people to fix

                                      Polly when they have a seemingly

                                      unrelated change that happens to

                                      break a Polly bot.</div>

                                  </div>

                                </div>

                              </blockquote>

                              <br>

                              The eventual goal here is to have this

                              technology in appropriate parts of the

                              main pipeline, and so the question here is

                              not really about breaking a "Polly bot",

                              but just about a "bot" in general. I've

                              given this question some thought and I

                              think it sits in a reasonable place in the

                              risk-reward space. The answer would be,

                              yes, we'd need to treat this like any

                              other part of the pipeline. However, I

                              believe that Polly has as many, or more,

                              active contributors than essentially any

                              other individual part of the mid-level

                              optimizer or CodeGen. As a result, there

                              will be people around in many time zones

                              to help with problems with Polly-related

                              code.<br>

                              <br>

                              <blockquote type="cite">

                                <div>

                                  <div>

                                    <div>  As far as I know, there were

                                      companies in the past that tried

                                      Polly without a whole lot of prior

                                      experience.  It would be great to

                                      hear what the experience was

                                      before adopting Polly at a much

                                      larger scale.</div>

                                  </div>

                                </div>

                              </blockquote>

                              <br>

                              I'm also interested, although I'll caution

                              against over-interpreting any evidence

                              here (positive or negative). Before a few

                              weeks ago, Polly didn't effectively run in

                              the pipeline after inlining, and so I

                              doubt it would have been much use outside

                              of embedded environments (and maybe some

                              HPC environments) with

                              straightforwardly-presented C code. It's

                              only now that this has been fixed that I

                              find the possibility of integrating this

                              in production interesting.<br>

                            </div>

                          </div>

                        </blockquote>

                        <div><br>

                        </div>

                        That is a good point. There are also biases

                        independent of past experiences (for disclosure

                        mine is (*) above). But I think it is objective

                        to say a Polly integration is a big piece to

                        swallow.Your pro-Polly argument lists a number

                        of categories that I think could be reasoned

                        about individually and partly evaluated with a

                        data-driven approach:</div>

                      <div>A) Architecture</div>

                      <div>- support for autoparallelism</div>

                      <div>- support for accelerators</div>

                      <div>- isl- rewrite? etc</div>

                      <div>...</div>

                      <div>B) Modelling</div>

                      <div>- polyhedral model</div>

                      <div>- temporal locality</div>

                      <div>- spatial locality </div>

                      <div>…</div>

                      <div>C) Analysis/Optimizations</div>

                      <div>- Dependence Analysis</div>

                      <div>- Transformation effective/power (loop nests,

                        quality of transformations, #vectorizable loops

                        etc)</div>

                      <div><br>

                      </div>

                      <div>A) is mostly Polly independent (except for

                        the isl question I guess). For B and C

                        performance/ compile-time /opt-viewer data on a

                        decent/wide range of benchmarks possibly at

                        different optimization levels (O2, O3, LTO, PGO

                        etc and combinations) should provide data-driven

                        insight into costs/benefits. <br>

                      </div>

                    </div>

                  </blockquote>

                  <br>

                </span> I agree. In practice, the first question is: Are

                will willing to take on Polly (+isl), in whole or in

                part, as a build dependency? If the answer is yes, the

                next question is: what parts should be reused or

                refactored for use in other parts of the pipeline? My

                argument is that we should take on Polly, or most of it,

                as a build dependency. Work on better unifying the

                developer communities as we start experimenting with

                other kinds of integration. This will, however, allow us

                to provide to all of our users these transformations

                through pragmas (and other kinds of optional

                enablement). This is an important first step.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>When you write that we should take Polly as a build

              dependency: are you envisioning this to be tied into LLVM

              to the point where we can't build LLVM without Polly?</div>

            <div>I thought the approach would rather be that a default

              build of LLVM would have Polly available but that there

              will "forever" be a `-DENABLE_POLLY=OFF` option?</div>

            <div>This seems like an important distinction to me, as

              making it more integrated but still optional for the

              correct behavior of LLVM means that people can continue to

              work and maintain "forever" an optimization pipeline which

              operates without Polly, while starting to get more

              integration in a pluggable form. I believe this is the

              direction that was taken with the GSoC last year, trying

              to get Polly as a "pluggable" dependency analysis.</div>

          </div>

        </div>

      </div>

    </blockquote>

    <br></div></div>

    I don't want to make statements about "forever":</div></blockquote><div><br></div><div>Right, I forgot the asterisk to explain my quotes around "forever": I meant by this "not until we have compelling data and agreement to do otherwise". I think we're on agreement on this though :)</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"> eventually, I'd

    like to have strong loop optimizations as an integrated part of

    LLVM. At that point, "Polly" might be little more than a shorthand

    for some parts of LLVM. Now LLVM currently has an issue whereby it's

    difficult to build customized versions with only a subset of needed

    passes, but that's an orthogonal issue. In the short term, it would

    certainly be separable. I believe that we'd certainly keep the

    `-DENABLE_POLLY=OFF` option as long as we felt it made sense.<br>

    <br>

    By build dependency, I mean: The sources are part of the main source

    tree, they're built by default, thus essentially all of the builders

    build them, and they'll end up in many distributions and derived

    products. This will allow us to build on what's there in whatever

    ways make sense.</div></blockquote><div><br></div><div>Thanks for clarifying!</div><div><br></div><div>  </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><span class=""><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">

            <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> <br>

                I'm not sure exactly how good this is, but polly has

                LNT-submitting bots, so the website can generate a

                comparison (e.g., <a class="m_-8232648019342385258m_-2166750394864443591moz-txt-link-freetext" href="http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182" target="_blank">http://lnt.llvm.org/db_default<wbr>/v4/nts/71208?compare_to=71182</a><wbr>).

                Looking at this comparison shows a number of potential

                problems but also cases where Polly really helps (and,

                FWIW, the largest two compile-time regressions are also

                associated with very large execution performance

                improvements). My first focus would certainly be on

                pragma-driven enablement.<br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>I'll add: PGO-driven enablement: spending compile-time

              where we know it can have an impact. What do you think?</div>

          </div>

        </div>

      </div>

    </blockquote>

    <br></span>

    Certainly (although, obviously, we still need reasonable confidence

    that it won't cause performance regressions to do that).<br></div></blockquote><div><br></div><div>Oh right, I meant it through an opt-in flag to begin with (-fenable-pgo-polly or something like that).</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div></div></div></div>