[llvm-dev] [RFC] Polly Status and Integration
Tobias Grosser via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 21 23:22:31 PDT 2017
On Fri, Sep 22, 2017, at 07:33, Hal Finkel via llvm-dev wrote:
>
> On 09/22/2017 12:25 AM, Mehdi AMINI wrote:
> >
> >
> > 2017-09-21 22:22 GMT-07:00 Hal Finkel <hfinkel at anl.gov
> > <mailto:hfinkel at anl.gov>>:
> >
> >
> > On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
> >> Hi Hal,
> >>
> >>
> >> 2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev
> >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
> >>
> >>
> >> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
> >>>
> >>>
> >>>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
> >>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
> >>>> wrote:
> >>>>
> >>>>
> >>>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
> >>>>> Hi Hal, Tobias, Michael and others,
> >>>>> *...*
> >>>>>
> >>>>> One thing that I’d like to see more details on is what
> >>>>> this means for the evolution of loop transformations in LLVM.
> >>>>>
> >>>>> Our more-or-less established direction was so far to
> >>>>> incrementally improve and generalize the required analyses
> >>>>> (e.g. the LoopVectorizer’s dependence analysis + loop
> >>>>> versioning analysis into a stand-alone analysis pass
> >>>>> (LoopAccessAnalysis)) and then build new transformations
> >>>>> (e.g. LoopDistribution, LoopLoadElimination,
> >>>>> LICMLoopVersioning) on top of these.
> >>>>>
> >>>>> The idea was that infrastructure would be incrementally
> >>>>> improved from two directions:
> >>>>>
> >>>>> - As new transformations are built analyses have to be
> >>>>> improved (e.g. past improvements to LAA to support the
> >>>>> LoopVersioning utility, future improvements for full
> >>>>> LoopSROA beyond just store->load forwarding [1] or the
> >>>>> improvements to LAA for the LoopFusion proposal[2])
> >>>>>
> >>>>> - As more complex loops would have to be analyzed we
> >>>>> either improve LAA or make DependenceAnalysis a drop-in
> >>>>> replacement for the memory analysis part in LAA
> >>>>
> >>>> Or we could use Polly's dependence analysis, which I
> >>>> believe to be more powerful, more robust, and more correct
> >>>> than DependenceAnalysis. I believe that the difficult part
> >>>> here is actually the pairing with predicated SCEV or
> >>>> whatever mechanism we want to use generate runtime
> >>>> predicates (this applies to use of DependenceAnalysis too).
> >>>
> >>> What is a good way to measure these assertions (More
> >>> powerful, more robust)? Are you saying the LLVM Dependence
> >>> Analysis is incorrect or do you actually mean less
> >>> conservative (or "more accurate" or something like that)?
> >>
> >> Sebastian's email covers the issues with the
> >> DependenceAnalysis pass pretty well.
> >>
> >> Regarding what's in LoopAccessAnalysis, I believe it to be
> >> correct, but more limited. It is not clear to me that LAA is
> >> bad at what it does based on what the vectorizer can handle.
> >> LAA could do better in some cases with non-unit-stride loops.
> >> Polly also handles piecewise-affine functions, which allows
> >> the modeling of loops with conditionals. Extending LAA to
> >> handle loop nests, moreover, seems likely to be non-trivial.
> >>
> >> Regardless, measuring these differences certainly seems like
> >> a good idea. I think that we can do this using optimization
> >> remarks. LAA already emits optimization remarks for loops in
> >> which it finds unsafe memory dependencies. Polly also emits
> >> optimization remarks. We may need to iterate some in order to
> >> setup a good comparison, but we should be able to collect
> >> statistics (and other information) by compiling code using
> >> -fsave-optimization-record (in combination with some other
> >> flags), and then analyzing the resulting YAML files.
> >>
> >>>>
> >>>>>
> >>>>> While this model may be slow it has all the benefits of
> >>>>> the incremental development model.
> >>>>
> >>>> The current model may have been slow in many areas, but I
> >>>> think that's mostly a question of development effort. My
> >>>> largest concern about the current model is that, to the
> >>>> extent that we're implementing classic loop transformations
> >>>> (e.g., fusion, distribution, interchange, skewing, tiling,
> >>>> and so on), we're repeating a historical design that is
> >>>> known to have several suboptimal properties. Chief among
> >>>> them is the lack of integration: many of these
> >>>> transformations are interconnected, and there's no good
> >>>> pass ordering in which to make independent decisions. Many
> >>>> of these transformations can be captured in a single model
> >>>> and we can get much better results by integrating them.
> >>>> There's also the matter of whether building these
> >>>> transformation on SCEV (or IR directly) is the best
> >>>> underlying infrastructure, or whether parts of Polly would
> >>>> be better.
> >>>
> >>> I believe that is true. What I wonder is is there a good
> >>> method to reason about it?
> >>
> >> If I understand what you mean, one way to look at it is this:
> >> This is not a canonicalization problem. Picking an optimal
> >> way to interchange loops may depend on how the result can be
> >> skewed and/or tiled, picking an optimal way to distribute
> >> loops often depends on what can be done afterward in each
> >> piece. Optimal here generally involves reasoning about the
> >> memory hierarchy (e.g., cache properties), available
> >> prefetching streams, register-file size, and so on.
> >>
> >> I know that I've seen some good examples in papers over the
> >> years that illustrate the phase-ordering challenges.
> >> Hopefully, someone will jump in here with some good
> >> references. One classic one is: William Pugh. Uniform
> >> Techniques for Loop Optimization. 1991.
> >>
> >>> Perhaps concrete examples or perhaps opt-viewer based
> >>> comparisons on large sets of benchmarks? In the big picture
> >>> you could make such a modeling argument for all compiler
> >>> optimizations.
> >>
> >> Certainly. However, in this case there's a well-studied
> >> unified model for this set of optimizations known to reduce
> >> phase-ordering effects. That's not true in general.
> >>
> >>>>
> >>>> That having been said, I think that integrating this
> >>>> technology into LLVM will also mean applying appropriate
> >>>> modularity. I think that we'll almost definitely want to
> >>>> make use of the dependence analysis separately as an
> >>>> analysis. We'll want to decide which of these
> >>>> transformations will be considered canonicalization (and
> >>>> run in the iterative pipeline) and which will be lowering
> >>>> (and run near the vectorizer). LoopSROA certainly sounds to
> >>>> me like canonicalization, but loop fusion might also fall
> >>>> into that category (i.e., we might want to fuse early to
> >>>> enable optimizations and then split late).
> >>>>
> >>>>>
> >>>>> Then there is the question of use cases. It’s fairly
> >>>>> obvious that anybody wanting to optimize a 5-deep highly
> >>>>> regular loop-nest operating on arrays should use Polly.
> >>>>> On the other hand it’s way less clear that we should use
> >>>>> it for singly or doubly nested not-so-regular loops which
> >>>>> are the norm in non-HPC workloads.
> >>>>
> >>>> This is clearly a good question, but thinking about Polly
> >>>> as a set of components, not as a monolithic transformation
> >>>> component, I think that polyhedral analysis and
> >>>> transformations can underlie a lot of the transformations
> >>>> we need for non-HPC code (and, which I'll point out, we
> >>>> need for modern HPC code too). In practice, the loops that
> >>>> we can actually analyze have affine dependencies, and Polly
> >>>> does, or can do, a better job at generating runtime
> >>>> predicates and dealing with piecewise-linear expressions
> >>>> than our current infrastructure.
> >>>>
> >>>> In short, I look at Polly as two things: First, an
> >>>> infrastructure for dealing with loop analysis and
> >>>> transformation. I view this as being broadly applicable.
> >>>> Second, an application of that to apply cost-model-driven
> >>>> classic loop transformations. To some extent this is going
> >>>> to be more useful for HPC codes, but also applies to
> >>>> machine learning, signal processing, graphics, and other
> >>>> areas.
> >>> I’m wondering if it could be used for pointing out headroom
> >>> for the existing LLVM ecosystem (*)
> >>
> >> Sure.
> >>
> >>>
> >>>>
> >>>>>
> >>>>> And this brings me to the maintenance question. Is it
> >>>>> reasonable to expect people to fix Polly when they have a
> >>>>> seemingly unrelated change that happens to break a Polly bot.
> >>>>
> >>>> The eventual goal here is to have this technology in
> >>>> appropriate parts of the main pipeline, and so the question
> >>>> here is not really about breaking a "Polly bot", but just
> >>>> about a "bot" in general. I've given this question some
> >>>> thought and I think it sits in a reasonable place in the
> >>>> risk-reward space. The answer would be, yes, we'd need to
> >>>> treat this like any other part of the pipeline. However, I
> >>>> believe that Polly has as many, or more, active
> >>>> contributors than essentially any other individual part of
> >>>> the mid-level optimizer or CodeGen. As a result, there will
> >>>> be people around in many time zones to help with problems
> >>>> with Polly-related code.
> >>>>
> >>>>> As far as I know, there were companies in the past that
> >>>>> tried Polly without a whole lot of prior experience. It
> >>>>> would be great to hear what the experience was before
> >>>>> adopting Polly at a much larger scale.
> >>>>
> >>>> I'm also interested, although I'll caution against
> >>>> over-interpreting any evidence here (positive or negative).
> >>>> Before a few weeks ago, Polly didn't effectively run in the
> >>>> pipeline after inlining, and so I doubt it would have been
> >>>> much use outside of embedded environments (and maybe some
> >>>> HPC environments) with straightforwardly-presented C code.
> >>>> It's only now that this has been fixed that I find the
> >>>> possibility of integrating this in production interesting.
> >>>
> >>> That is a good point. There are also biases independent of
> >>> past experiences (for disclosure mine is (*) above). But I
> >>> think it is objective to say a Polly integration is a big
> >>> piece to swallow.Your pro-Polly argument lists a number of
> >>> categories that I think could be reasoned about individually
> >>> and partly evaluated with a data-driven approach:
> >>> A) Architecture
> >>> - support for autoparallelism
> >>> - support for accelerators
> >>> - isl- rewrite? etc
> >>> ...
> >>> B) Modelling
> >>> - polyhedral model
> >>> - temporal locality
> >>> - spatial locality
> >>> …
> >>> C) Analysis/Optimizations
> >>> - Dependence Analysis
> >>> - Transformation effective/power (loop nests, quality of
> >>> transformations, #vectorizable loops etc)
> >>>
> >>> A) is mostly Polly independent (except for the isl question
> >>> I guess). For B and C performance/ compile-time /opt-viewer
> >>> data on a decent/wide range of benchmarks possibly at
> >>> different optimization levels (O2, O3, LTO, PGO etc and
> >>> combinations) should provide data-driven insight into
> >>> costs/benefits.
> >>
> >> I agree. In practice, the first question is: Are will willing
> >> to take on Polly (+isl), in whole or in part, as a build
> >> dependency? If the answer is yes, the next question is: what
> >> parts should be reused or refactored for use in other parts
> >> of the pipeline? My argument is that we should take on Polly,
> >> or most of it, as a build dependency. Work on better unifying
> >> the developer communities as we start experimenting with
> >> other kinds of integration. This will, however, allow us to
> >> provide to all of our users these transformations through
> >> pragmas (and other kinds of optional enablement). This is an
> >> important first step.
> >>
> >>
> >> When you write that we should take Polly as a build dependency:
> >> are you envisioning this to be tied into LLVM to the point where
> >> we can't build LLVM without Polly?
> >> I thought the approach would rather be that a default build of
> >> LLVM would have Polly available but that there will "forever" be
> >> a `-DENABLE_POLLY=OFF` option?
> >> This seems like an important distinction to me, as making it more
> >> integrated but still optional for the correct behavior of LLVM
> >> means that people can continue to work and maintain "forever" an
> >> optimization pipeline which operates without Polly, while
> >> starting to get more integration in a pluggable form. I believe
> >> this is the direction that was taken with the GSoC last year,
> >> trying to get Polly as a "pluggable" dependency analysis.
> >
> > I don't want to make statements about "forever":
> >
> >
> > Right, I forgot the asterisk to explain my quotes around "forever": I
> > meant by this "not until we have compelling data and agreement to do
> > otherwise". I think we're on agreement on this though :)
>
> I think we are as well :-)
>
> >
> > eventually, I'd like to have strong loop optimizations as an
> > integrated part of LLVM. At that point, "Polly" might be little
> > more than a shorthand for some parts of LLVM. Now LLVM currently
> > has an issue whereby it's difficult to build customized versions
> > with only a subset of needed passes, but that's an orthogonal
> > issue. In the short term, it would certainly be separable. I
> > believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option
> > as long as we felt it made sense.
> >
> > By build dependency, I mean: The sources are part of the main
> > source tree, they're built by default, thus essentially all of the
> > builders build them, and they'll end up in many distributions and
> > derived products. This will allow us to build on what's there in
> > whatever ways make sense.
> >
> >
> > Thanks for clarifying!
> >
> >>
> >> I'm not sure exactly how good this is, but polly has
> >> LNT-submitting bots, so the website can generate a comparison
> >> (e.g.,
> >> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182
> >> <http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182>).
> >> Looking at this comparison shows a number of potential
> >> problems but also cases where Polly really helps (and, FWIW,
> >> the largest two compile-time regressions are also associated
> >> with very large execution performance improvements). My first
> >> focus would certainly be on pragma-driven enablement.
> >>
> >>
> >> I'll add: PGO-driven enablement: spending compile-time where we
> >> know it can have an impact. What do you think?
> >
> > Certainly (although, obviously, we still need reasonable
> > confidence that it won't cause performance regressions to do that).
> >
> >
> > Oh right, I meant it through an opt-in flag to begin with
> > (-fenable-pgo-polly or something like that).
>
> SGTM.
That's certainly very desirable. Zino @ Qualcomm suggested this already
a couple of times.
Best,
Tobias
>
> -Hal
>
> >
> > --
> > Mehdi
> >
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list