[llvm-dev] [RFC] Polly Status and Integration

Thu Sep 21 23:22:31 PDT 2017

On Fri, Sep 22, 2017, at 07:33, Hal Finkel via llvm-dev wrote:
> 
> On 09/22/2017 12:25 AM, Mehdi AMINI wrote:
> >
> >
> > 2017-09-21 22:22 GMT-07:00 Hal Finkel <hfinkel at anl.gov 
> > <mailto:hfinkel at anl.gov>>:
> >
> >
> >     On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
> >>     Hi Hal,
> >>
> >>
> >>     2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev
> >>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
> >>
> >>
> >>         On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
> >>>
> >>>
> >>>>         On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
> >>>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> >>>>         wrote:
> >>>>
> >>>>
> >>>>         On 09/11/2017 12:26 PM, Adam Nemet wrote:
> >>>>>         Hi Hal, Tobias, Michael and others,
> >>>>>         *...*
> >>>>>
> >>>>>         One thing that I’d like to see more details on is what
> >>>>>         this means for the evolution of loop transformations in LLVM.
> >>>>>
> >>>>>         Our more-or-less established direction was so far to
> >>>>>         incrementally improve and generalize the required analyses
> >>>>>         (e.g. the LoopVectorizer’s dependence analysis + loop
> >>>>>         versioning analysis into a stand-alone analysis pass
> >>>>>         (LoopAccessAnalysis)) and then build new transformations
> >>>>>         (e.g. LoopDistribution, LoopLoadElimination,
> >>>>>         LICMLoopVersioning) on top of these.
> >>>>>
> >>>>>         The idea was that infrastructure would be incrementally
> >>>>>         improved from two directions:
> >>>>>
> >>>>>         - As new transformations are built analyses have to be
> >>>>>         improved (e.g. past improvements to LAA to support the
> >>>>>         LoopVersioning utility, future improvements for full
> >>>>>         LoopSROA beyond just store->load forwarding [1] or the
> >>>>>         improvements to LAA for the LoopFusion proposal[2])
> >>>>>
> >>>>>         - As more complex loops would have to be analyzed we
> >>>>>         either improve LAA or make DependenceAnalysis a drop-in
> >>>>>         replacement for the memory analysis part in LAA
> >>>>
> >>>>         Or we could use Polly's dependence analysis, which I
> >>>>         believe to be more powerful, more robust, and more correct
> >>>>         than DependenceAnalysis. I believe that the difficult part
> >>>>         here is actually the pairing with predicated SCEV or
> >>>>         whatever mechanism we want to use generate runtime
> >>>>         predicates (this applies to use of DependenceAnalysis too).
> >>>
> >>>         What is a good way to measure these assertions (More
> >>>         powerful, more robust)? Are you saying the LLVM Dependence
> >>>         Analysis is incorrect or do you actually mean less
> >>>         conservative (or "more accurate" or something like that)?
> >>
> >>         Sebastian's email covers the issues with the
> >>         DependenceAnalysis pass pretty well.
> >>
> >>         Regarding what's in LoopAccessAnalysis, I believe it to be
> >>         correct, but more limited. It is not clear to me that LAA is
> >>         bad at what it does based on what the vectorizer can handle.
> >>         LAA could do better in some cases with non-unit-stride loops.
> >>         Polly also handles piecewise-affine functions, which allows
> >>         the modeling of loops with conditionals. Extending LAA to
> >>         handle loop nests, moreover, seems likely to be non-trivial.
> >>
> >>         Regardless, measuring these differences certainly seems like
> >>         a good idea. I think that we can do this using optimization
> >>         remarks. LAA already emits optimization remarks for loops in
> >>         which it finds unsafe memory dependencies. Polly also emits
> >>         optimization remarks. We may need to iterate some in order to
> >>         setup a good comparison, but we should be able to collect
> >>         statistics (and other information) by compiling code using
> >>         -fsave-optimization-record (in combination with some other
> >>         flags), and then analyzing the resulting YAML files.
> >>
> >>>>
> >>>>>
> >>>>>         While this model may be slow it has all the benefits of
> >>>>>         the incremental development model.
> >>>>
> >>>>         The current model may have been slow in many areas, but I
> >>>>         think that's mostly a question of development effort. My
> >>>>         largest concern about the current model is that, to the
> >>>>         extent that we're implementing classic loop transformations
> >>>>         (e.g., fusion, distribution, interchange, skewing, tiling,
> >>>>         and so on), we're repeating a historical design that is
> >>>>         known to have several suboptimal properties. Chief among
> >>>>         them is the lack of integration: many of these
> >>>>         transformations are interconnected, and there's no good
> >>>>         pass ordering in which to make independent decisions. Many
> >>>>         of these transformations can be captured in a single model
> >>>>         and we can get much better results by integrating them.
> >>>>         There's also the matter of whether building these
> >>>>         transformation on SCEV (or IR directly) is the best
> >>>>         underlying infrastructure, or whether parts of Polly would
> >>>>         be better.
> >>>
> >>>         I believe that is true. What I wonder is is there a good
> >>>         method to reason about it?
> >>
> >>         If I understand what you mean, one way to look at it is this:
> >>         This is not a canonicalization problem. Picking an optimal
> >>         way to interchange loops may depend on how the result can be
> >>         skewed and/or tiled, picking an optimal way to distribute
> >>         loops often depends on what can be done afterward in each
> >>         piece. Optimal here generally involves reasoning about the
> >>         memory hierarchy (e.g., cache properties), available
> >>         prefetching streams, register-file size, and so on.
> >>
> >>         I know that I've seen some good examples in papers over the
> >>         years that illustrate the phase-ordering challenges.
> >>         Hopefully, someone will jump in here with some good
> >>         references. One classic one is: William Pugh. Uniform
> >>         Techniques for Loop Optimization. 1991.
> >>
> >>>         Perhaps concrete examples or perhaps opt-viewer based
> >>>         comparisons on large sets of benchmarks? In the big picture
> >>>         you could make such a modeling argument for all compiler
> >>>         optimizations.
> >>
> >>         Certainly. However, in this case there's a well-studied
> >>         unified model for this set of optimizations known to reduce
> >>         phase-ordering effects. That's not true in general.
> >>
> >>>>
> >>>>         That having been said, I think that integrating this
> >>>>         technology into LLVM will also mean applying appropriate
> >>>>         modularity. I think that we'll almost definitely want to
> >>>>         make use of the dependence analysis separately as an
> >>>>         analysis. We'll want to decide which of these
> >>>>         transformations will be considered canonicalization (and
> >>>>         run in the iterative pipeline) and which will be lowering
> >>>>         (and run near the vectorizer). LoopSROA certainly sounds to
> >>>>         me like canonicalization, but loop fusion might also fall
> >>>>         into that category (i.e., we might want to fuse early to
> >>>>         enable optimizations and then split late).
> >>>>
> >>>>>
> >>>>>         Then there is the question of use cases.  It’s fairly
> >>>>>         obvious that anybody wanting to optimize a 5-deep highly
> >>>>>         regular loop-nest operating on arrays should use Polly. 
> >>>>>         On the other hand it’s way less clear that we should use
> >>>>>         it for singly or doubly nested not-so-regular loops which
> >>>>>         are the norm in non-HPC workloads.
> >>>>
> >>>>         This is clearly a good question, but thinking about Polly
> >>>>         as a set of components, not as a monolithic transformation
> >>>>         component, I think that polyhedral analysis and
> >>>>         transformations can underlie a lot of the transformations
> >>>>         we need for non-HPC code (and, which I'll point out, we
> >>>>         need for modern HPC code too). In practice, the loops that
> >>>>         we can actually analyze have affine dependencies, and Polly
> >>>>         does, or can do, a better job at generating runtime
> >>>>         predicates and dealing with piecewise-linear expressions
> >>>>         than our current infrastructure.
> >>>>
> >>>>         In short, I look at Polly as two things: First, an
> >>>>         infrastructure for dealing with loop analysis and
> >>>>         transformation. I view this as being broadly applicable.
> >>>>         Second, an application of that to apply cost-model-driven
> >>>>         classic loop transformations. To some extent this is going
> >>>>         to be more useful for HPC codes, but also applies to
> >>>>         machine learning, signal processing, graphics, and other
> >>>>         areas.
> >>>         I’m wondering if it could be used for pointing out headroom
> >>>         for the existing LLVM ecosystem (*)
> >>
> >>         Sure.
> >>
> >>>
> >>>>
> >>>>>
> >>>>>         And this brings me to the maintenance question.  Is it
> >>>>>         reasonable to expect people to fix Polly when they have a
> >>>>>         seemingly unrelated change that happens to break a Polly bot.
> >>>>
> >>>>         The eventual goal here is to have this technology in
> >>>>         appropriate parts of the main pipeline, and so the question
> >>>>         here is not really about breaking a "Polly bot", but just
> >>>>         about a "bot" in general. I've given this question some
> >>>>         thought and I think it sits in a reasonable place in the
> >>>>         risk-reward space. The answer would be, yes, we'd need to
> >>>>         treat this like any other part of the pipeline. However, I
> >>>>         believe that Polly has as many, or more, active
> >>>>         contributors than essentially any other individual part of
> >>>>         the mid-level optimizer or CodeGen. As a result, there will
> >>>>         be people around in many time zones to help with problems
> >>>>         with Polly-related code.
> >>>>
> >>>>>          As far as I know, there were companies in the past that
> >>>>>         tried Polly without a whole lot of prior experience.  It
> >>>>>         would be great to hear what the experience was before
> >>>>>         adopting Polly at a much larger scale.
> >>>>
> >>>>         I'm also interested, although I'll caution against
> >>>>         over-interpreting any evidence here (positive or negative).
> >>>>         Before a few weeks ago, Polly didn't effectively run in the
> >>>>         pipeline after inlining, and so I doubt it would have been
> >>>>         much use outside of embedded environments (and maybe some
> >>>>         HPC environments) with straightforwardly-presented C code.
> >>>>         It's only now that this has been fixed that I find the
> >>>>         possibility of integrating this in production interesting.
> >>>
> >>>         That is a good point. There are also biases independent of
> >>>         past experiences (for disclosure mine is (*) above). But I
> >>>         think it is objective to say a Polly integration is a big
> >>>         piece to swallow.Your pro-Polly argument lists a number of
> >>>         categories that I think could be reasoned about individually
> >>>         and partly evaluated with a data-driven approach:
> >>>         A) Architecture
> >>>         - support for autoparallelism
> >>>         - support for accelerators
> >>>         - isl- rewrite? etc
> >>>         ...
> >>>         B) Modelling
> >>>         - polyhedral model
> >>>         - temporal locality
> >>>         - spatial locality
> >>>         …
> >>>         C) Analysis/Optimizations
> >>>         - Dependence Analysis
> >>>         - Transformation effective/power (loop nests, quality of
> >>>         transformations, #vectorizable loops etc)
> >>>
> >>>         A) is mostly Polly independent (except for the isl question
> >>>         I guess). For B and C performance/ compile-time /opt-viewer
> >>>         data on a decent/wide range of benchmarks possibly at
> >>>         different optimization levels (O2, O3, LTO, PGO etc and
> >>>         combinations) should provide data-driven insight into
> >>>         costs/benefits.
> >>
> >>         I agree. In practice, the first question is: Are will willing
> >>         to take on Polly (+isl), in whole or in part, as a build
> >>         dependency? If the answer is yes, the next question is: what
> >>         parts should be reused or refactored for use in other parts
> >>         of the pipeline? My argument is that we should take on Polly,
> >>         or most of it, as a build dependency. Work on better unifying
> >>         the developer communities as we start experimenting with
> >>         other kinds of integration. This will, however, allow us to
> >>         provide to all of our users these transformations through
> >>         pragmas (and other kinds of optional enablement). This is an
> >>         important first step.
> >>
> >>
> >>     When you write that we should take Polly as a build dependency:
> >>     are you envisioning this to be tied into LLVM to the point where
> >>     we can't build LLVM without Polly?
> >>     I thought the approach would rather be that a default build of
> >>     LLVM would have Polly available but that there will "forever" be
> >>     a `-DENABLE_POLLY=OFF` option?
> >>     This seems like an important distinction to me, as making it more
> >>     integrated but still optional for the correct behavior of LLVM
> >>     means that people can continue to work and maintain "forever" an
> >>     optimization pipeline which operates without Polly, while
> >>     starting to get more integration in a pluggable form. I believe
> >>     this is the direction that was taken with the GSoC last year,
> >>     trying to get Polly as a "pluggable" dependency analysis.
> >
> >     I don't want to make statements about "forever":
> >
> >
> > Right, I forgot the asterisk to explain my quotes around "forever": I 
> > meant by this "not until we have compelling data and agreement to do 
> > otherwise". I think we're on agreement on this though :)
> 
> I think we are as well :-)
> 
> >
> >     eventually, I'd like to have strong loop optimizations as an
> >     integrated part of LLVM. At that point, "Polly" might be little
> >     more than a shorthand for some parts of LLVM. Now LLVM currently
> >     has an issue whereby it's difficult to build customized versions
> >     with only a subset of needed passes, but that's an orthogonal
> >     issue. In the short term, it would certainly be separable. I
> >     believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option
> >     as long as we felt it made sense.
> >
> >     By build dependency, I mean: The sources are part of the main
> >     source tree, they're built by default, thus essentially all of the
> >     builders build them, and they'll end up in many distributions and
> >     derived products. This will allow us to build on what's there in
> >     whatever ways make sense.
> >
> >
> > Thanks for clarifying!
> >
> >>
> >>         I'm not sure exactly how good this is, but polly has
> >>         LNT-submitting bots, so the website can generate a comparison
> >>         (e.g.,
> >>         http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182
> >>         <http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182>).
> >>         Looking at this comparison shows a number of potential
> >>         problems but also cases where Polly really helps (and, FWIW,
> >>         the largest two compile-time regressions are also associated
> >>         with very large execution performance improvements). My first
> >>         focus would certainly be on pragma-driven enablement.
> >>
> >>
> >>     I'll add: PGO-driven enablement: spending compile-time where we
> >>     know it can have an impact. What do you think?
> >
> >     Certainly (although, obviously, we still need reasonable
> >     confidence that it won't cause performance regressions to do that).
> >
> >
> > Oh right, I meant it through an opt-in flag to begin with 
> > (-fenable-pgo-polly or something like that).
> 
> SGTM.

That's certainly very desirable. Zino @ Qualcomm suggested this already
a couple of times.

Best,
Tobias

> 
>   -Hal
> 
> >
> > -- 
> > Mehdi
> >
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev