[llvm-dev] [RFC] Polly Status and Integration

Thu Sep 21 23:33:42 PDT 2017

On Fri, Sep 22, 2017, at 07:22, Hal Finkel via llvm-dev wrote:
> 
> On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
> > Hi Hal,
> >
> >
> > 2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev 
> > <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
> >
> >
> >     On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
> >>
> >>
> >>>     On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
> >>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> >>>
> >>>
> >>>     On 09/11/2017 12:26 PM, Adam Nemet wrote:
> >>>>     Hi Hal, Tobias, Michael and others,
> >>>>     *...*
> >>>>
> >>>>     One thing that I’d like to see more details on is what this
> >>>>     means for the evolution of loop transformations in LLVM.
> >>>>
> >>>>     Our more-or-less established direction was so far to
> >>>>     incrementally improve and generalize the required analyses
> >>>>     (e.g. the LoopVectorizer’s dependence analysis + loop
> >>>>     versioning analysis into a stand-alone analysis pass
> >>>>     (LoopAccessAnalysis)) and then build new transformations (e.g.
> >>>>     LoopDistribution, LoopLoadElimination, LICMLoopVersioning) on
> >>>>     top of these.
> >>>>
> >>>>     The idea was that infrastructure would be incrementally
> >>>>     improved from two directions:
> >>>>
> >>>>     - As new transformations are built analyses have to be improved
> >>>>     (e.g. past improvements to LAA to support the LoopVersioning
> >>>>     utility, future improvements for full LoopSROA beyond just
> >>>>     store->load forwarding [1] or the improvements to LAA for the
> >>>>     LoopFusion proposal[2])
> >>>>
> >>>>     - As more complex loops would have to be analyzed we either
> >>>>     improve LAA or make DependenceAnalysis a drop-in replacement
> >>>>     for the memory analysis part in LAA
> >>>
> >>>     Or we could use Polly's dependence analysis, which I believe to
> >>>     be more powerful, more robust, and more correct than
> >>>     DependenceAnalysis. I believe that the difficult part here is
> >>>     actually the pairing with predicated SCEV or whatever mechanism
> >>>     we want to use generate runtime predicates (this applies to use
> >>>     of DependenceAnalysis too).
> >>
> >>     What is a good way to measure these assertions (More powerful,
> >>     more robust)? Are you saying the LLVM Dependence Analysis is
> >>     incorrect or do you actually mean less conservative (or "more
> >>     accurate" or something like that)?
> >
> >     Sebastian's email covers the issues with the DependenceAnalysis
> >     pass pretty well.
> >
> >     Regarding what's in LoopAccessAnalysis, I believe it to be
> >     correct, but more limited. It is not clear to me that LAA is bad
> >     at what it does based on what the vectorizer can handle. LAA could
> >     do better in some cases with non-unit-stride loops. Polly also
> >     handles piecewise-affine functions, which allows the modeling of
> >     loops with conditionals. Extending LAA to handle loop nests,
> >     moreover, seems likely to be non-trivial.
> >
> >     Regardless, measuring these differences certainly seems like a
> >     good idea. I think that we can do this using optimization remarks.
> >     LAA already emits optimization remarks for loops in which it finds
> >     unsafe memory dependencies. Polly also emits optimization remarks.
> >     We may need to iterate some in order to setup a good comparison,
> >     but we should be able to collect statistics (and other
> >     information) by compiling code using -fsave-optimization-record
> >     (in combination with some other flags), and then analyzing the
> >     resulting YAML files.
> >
> >>>
> >>>>
> >>>>     While this model may be slow it has all the benefits of the
> >>>>     incremental development model.
> >>>
> >>>     The current model may have been slow in many areas, but I think
> >>>     that's mostly a question of development effort. My largest
> >>>     concern about the current model is that, to the extent that
> >>>     we're implementing classic loop transformations (e.g., fusion,
> >>>     distribution, interchange, skewing, tiling, and so on), we're
> >>>     repeating a historical design that is known to have several
> >>>     suboptimal properties. Chief among them is the lack of
> >>>     integration: many of these transformations are interconnected,
> >>>     and there's no good pass ordering in which to make independent
> >>>     decisions. Many of these transformations can be captured in a
> >>>     single model and we can get much better results by integrating
> >>>     them. There's also the matter of whether building these
> >>>     transformation on SCEV (or IR directly) is the best underlying
> >>>     infrastructure, or whether parts of Polly would be better.
> >>
> >>     I believe that is true. What I wonder is is there a good method
> >>     to reason about it?
> >
> >     If I understand what you mean, one way to look at it is this: This
> >     is not a canonicalization problem. Picking an optimal way to
> >     interchange loops may depend on how the result can be skewed
> >     and/or tiled, picking an optimal way to distribute loops often
> >     depends on what can be done afterward in each piece. Optimal here
> >     generally involves reasoning about the memory hierarchy (e.g.,
> >     cache properties), available prefetching streams, register-file
> >     size, and so on.
> >
> >     I know that I've seen some good examples in papers over the years
> >     that illustrate the phase-ordering challenges. Hopefully, someone
> >     will jump in here with some good references. One classic one is:
> >     William Pugh. Uniform Techniques for Loop Optimization. 1991.
> >
> >>     Perhaps concrete examples or perhaps opt-viewer based comparisons
> >>     on large sets of benchmarks? In the big picture you could make
> >>     such a modeling argument for all compiler optimizations.
> >
> >     Certainly. However, in this case there's a well-studied unified
> >     model for this set of optimizations known to reduce phase-ordering
> >     effects. That's not true in general.
> >
> >>>
> >>>     That having been said, I think that integrating this technology
> >>>     into LLVM will also mean applying appropriate modularity. I
> >>>     think that we'll almost definitely want to make use of the
> >>>     dependence analysis separately as an analysis. We'll want to
> >>>     decide which of these transformations will be considered
> >>>     canonicalization (and run in the iterative pipeline) and which
> >>>     will be lowering (and run near the vectorizer). LoopSROA
> >>>     certainly sounds to me like canonicalization, but loop fusion
> >>>     might also fall into that category (i.e., we might want to fuse
> >>>     early to enable optimizations and then split late).
> >>>
> >>>>
> >>>>     Then there is the question of use cases.  It’s fairly obvious
> >>>>     that anybody wanting to optimize a 5-deep highly regular
> >>>>     loop-nest operating on arrays should use Polly.  On the other
> >>>>     hand it’s way less clear that we should use it for singly or
> >>>>     doubly nested not-so-regular loops which are the norm in
> >>>>     non-HPC workloads.
> >>>
> >>>     This is clearly a good question, but thinking about Polly as a
> >>>     set of components, not as a monolithic transformation component,
> >>>     I think that polyhedral analysis and transformations can
> >>>     underlie a lot of the transformations we need for non-HPC code
> >>>     (and, which I'll point out, we need for modern HPC code too). In
> >>>     practice, the loops that we can actually analyze have affine
> >>>     dependencies, and Polly does, or can do, a better job at
> >>>     generating runtime predicates and dealing with piecewise-linear
> >>>     expressions than our current infrastructure.
> >>>
> >>>     In short, I look at Polly as two things: First, an
> >>>     infrastructure for dealing with loop analysis and
> >>>     transformation. I view this as being broadly applicable. Second,
> >>>     an application of that to apply cost-model-driven classic loop
> >>>     transformations. To some extent this is going to be more useful
> >>>     for HPC codes, but also applies to machine learning, signal
> >>>     processing, graphics, and other areas.
> >>     I’m wondering if it could be used for pointing out headroom for
> >>     the existing LLVM ecosystem (*)
> >
> >     Sure.
> >
> >>
> >>>
> >>>>
> >>>>     And this brings me to the maintenance question.  Is it
> >>>>     reasonable to expect people to fix Polly when they have a
> >>>>     seemingly unrelated change that happens to break a Polly bot.
> >>>
> >>>     The eventual goal here is to have this technology in appropriate
> >>>     parts of the main pipeline, and so the question here is not
> >>>     really about breaking a "Polly bot", but just about a "bot" in
> >>>     general. I've given this question some thought and I think it
> >>>     sits in a reasonable place in the risk-reward space. The answer
> >>>     would be, yes, we'd need to treat this like any other part of
> >>>     the pipeline. However, I believe that Polly has as many, or
> >>>     more, active contributors than essentially any other individual
> >>>     part of the mid-level optimizer or CodeGen. As a result, there
> >>>     will be people around in many time zones to help with problems
> >>>     with Polly-related code.
> >>>
> >>>>      As far as I know, there were companies in the past that tried
> >>>>     Polly without a whole lot of prior experience.  It would be
> >>>>     great to hear what the experience was before adopting Polly at
> >>>>     a much larger scale.
> >>>
> >>>     I'm also interested, although I'll caution against
> >>>     over-interpreting any evidence here (positive or negative).
> >>>     Before a few weeks ago, Polly didn't effectively run in the
> >>>     pipeline after inlining, and so I doubt it would have been much
> >>>     use outside of embedded environments (and maybe some HPC
> >>>     environments) with straightforwardly-presented C code. It's only
> >>>     now that this has been fixed that I find the possibility of
> >>>     integrating this in production interesting.
> >>
> >>     That is a good point. There are also biases independent of past
> >>     experiences (for disclosure mine is (*) above). But I think it is
> >>     objective to say a Polly integration is a big piece to
> >>     swallow.Your pro-Polly argument lists a number of categories that
> >>     I think could be reasoned about individually and partly evaluated
> >>     with a data-driven approach:
> >>     A) Architecture
> >>     - support for autoparallelism
> >>     - support for accelerators
> >>     - isl- rewrite? etc
> >>     ...
> >>     B) Modelling
> >>     - polyhedral model
> >>     - temporal locality
> >>     - spatial locality
> >>     …
> >>     C) Analysis/Optimizations
> >>     - Dependence Analysis
> >>     - Transformation effective/power (loop nests, quality of
> >>     transformations, #vectorizable loops etc)
> >>
> >>     A) is mostly Polly independent (except for the isl question I
> >>     guess). For B and C performance/ compile-time /opt-viewer data on
> >>     a decent/wide range of benchmarks possibly at different
> >>     optimization levels (O2, O3, LTO, PGO etc and combinations)
> >>     should provide data-driven insight into costs/benefits.
> >
> >     I agree. In practice, the first question is: Are will willing to
> >     take on Polly (+isl), in whole or in part, as a build dependency?
> >     If the answer is yes, the next question is: what parts should be
> >     reused or refactored for use in other parts of the pipeline? My
> >     argument is that we should take on Polly, or most of it, as a
> >     build dependency. Work on better unifying the developer
> >     communities as we start experimenting with other kinds of
> >     integration. This will, however, allow us to provide to all of our
> >     users these transformations through pragmas (and other kinds of
> >     optional enablement). This is an important first step.
> >
> >
> > When you write that we should take Polly as a build dependency: are 
> > you envisioning this to be tied into LLVM to the point where we can't 
> > build LLVM without Polly?
> > I thought the approach would rather be that a default build of LLVM 
> > would have Polly available but that there will "forever" be a 
> > `-DENABLE_POLLY=OFF` option?
> > This seems like an important distinction to me, as making it more 
> > integrated but still optional for the correct behavior of LLVM means 
> > that people can continue to work and maintain "forever" an 
> > optimization pipeline which operates without Polly, while starting to 
> > get more integration in a pluggable form. I believe this is the 
> > direction that was taken with the GSoC last year, trying to get Polly 
> > as a "pluggable" dependency analysis.
> 
> I don't want to make statements about "forever": eventually, I'd like to 
> have strong loop optimizations as an integrated part of LLVM. At that 
> point, "Polly" might be little more than a shorthand for some parts of 
> LLVM. Now LLVM currently has an issue whereby it's difficult to build 
> customized versions with only a subset of needed passes, but that's an 
> orthogonal issue. In the short term, it would certainly be separable. I 
> believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option as long 
> as we felt it made sense.
> 
> By build dependency, I mean: The sources are part of the main source 
> tree, they're built by default, thus essentially all of the builders 
> build them, and they'll end up in many distributions and derived 
> products. This will allow us to build on what's there in whatever ways 
> make sense.

I agree with the above. I expect Polly to be checked out by default,
built by default, with a flag to disable it at first. Over time, I
expect what was known to be "Polly" to dissolve in LLVM, being replaced
(e.g. by some of Johannes' code) and recoded. (I think I said this
already earlier, but I cannot find it). As Chris kept saying, each
component of LLVM has been rewritten twice before it became really
solid. We already rewrote various pieces of Polly and a lot of code in
isl multiple times. I expect this to continue. I also expect that the
design of Polly will change over time -- certainly incorporating some of
the ideas Johannes mentioned. At some point when Polly has sufficiently
widely tested and deeply integrated maintaining a separate flag might
not make sense, but this is certainly a discussion to be had in the
future.

Even today, Polly is already part of many major distributions:

ubuntu, debian, the LLVM releases, and AFAIK Jack also ships Polly in
some Apple/Darwin package repos. Before more actively promoting these,
we need wide testing on the core LLVM side to make sure the software
shipped matches the high quality standards LLVM adheres to.

Best,
Tobias