[llvm-dev] [RFC] Polly Status and Integration

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Thu Sep 21 22:33:34 PDT 2017

On 09/22/2017 12:25 AM, Mehdi AMINI wrote:
> 2017-09-21 22:22 GMT-07:00 Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>>:
>     On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
>>     Hi Hal,
>>     2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>>         On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>>>         On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
>>>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
>>>>         wrote:
>>>>         On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>>>>         Hi Hal, Tobias, Michael and others,
>>>>>         *...*
>>>>>         One thing that I’d like to see more details on is what
>>>>>         this means for the evolution of loop transformations in LLVM.
>>>>>         Our more-or-less established direction was so far to
>>>>>         incrementally improve and generalize the required analyses
>>>>>         (e.g. the LoopVectorizer’s dependence analysis + loop
>>>>>         versioning analysis into a stand-alone analysis pass
>>>>>         (LoopAccessAnalysis)) and then build new transformations
>>>>>         (e.g. LoopDistribution, LoopLoadElimination,
>>>>>         LICMLoopVersioning) on top of these.
>>>>>         The idea was that infrastructure would be incrementally
>>>>>         improved from two directions:
>>>>>         - As new transformations are built analyses have to be
>>>>>         improved (e.g. past improvements to LAA to support the
>>>>>         LoopVersioning utility, future improvements for full
>>>>>         LoopSROA beyond just store->load forwarding [1] or the
>>>>>         improvements to LAA for the LoopFusion proposal[2])
>>>>>         - As more complex loops would have to be analyzed we
>>>>>         either improve LAA or make DependenceAnalysis a drop-in
>>>>>         replacement for the memory analysis part in LAA
>>>>         Or we could use Polly's dependence analysis, which I
>>>>         believe to be more powerful, more robust, and more correct
>>>>         than DependenceAnalysis. I believe that the difficult part
>>>>         here is actually the pairing with predicated SCEV or
>>>>         whatever mechanism we want to use generate runtime
>>>>         predicates (this applies to use of DependenceAnalysis too).
>>>         What is a good way to measure these assertions (More
>>>         powerful, more robust)? Are you saying the LLVM Dependence
>>>         Analysis is incorrect or do you actually mean less
>>>         conservative (or "more accurate" or something like that)?
>>         Sebastian's email covers the issues with the
>>         DependenceAnalysis pass pretty well.
>>         Regarding what's in LoopAccessAnalysis, I believe it to be
>>         correct, but more limited. It is not clear to me that LAA is
>>         bad at what it does based on what the vectorizer can handle.
>>         LAA could do better in some cases with non-unit-stride loops.
>>         Polly also handles piecewise-affine functions, which allows
>>         the modeling of loops with conditionals. Extending LAA to
>>         handle loop nests, moreover, seems likely to be non-trivial.
>>         Regardless, measuring these differences certainly seems like
>>         a good idea. I think that we can do this using optimization
>>         remarks. LAA already emits optimization remarks for loops in
>>         which it finds unsafe memory dependencies. Polly also emits
>>         optimization remarks. We may need to iterate some in order to
>>         setup a good comparison, but we should be able to collect
>>         statistics (and other information) by compiling code using
>>         -fsave-optimization-record (in combination with some other
>>         flags), and then analyzing the resulting YAML files.
>>>>>         While this model may be slow it has all the benefits of
>>>>>         the incremental development model.
>>>>         The current model may have been slow in many areas, but I
>>>>         think that's mostly a question of development effort. My
>>>>         largest concern about the current model is that, to the
>>>>         extent that we're implementing classic loop transformations
>>>>         (e.g., fusion, distribution, interchange, skewing, tiling,
>>>>         and so on), we're repeating a historical design that is
>>>>         known to have several suboptimal properties. Chief among
>>>>         them is the lack of integration: many of these
>>>>         transformations are interconnected, and there's no good
>>>>         pass ordering in which to make independent decisions. Many
>>>>         of these transformations can be captured in a single model
>>>>         and we can get much better results by integrating them.
>>>>         There's also the matter of whether building these
>>>>         transformation on SCEV (or IR directly) is the best
>>>>         underlying infrastructure, or whether parts of Polly would
>>>>         be better.
>>>         I believe that is true. What I wonder is is there a good
>>>         method to reason about it?
>>         If I understand what you mean, one way to look at it is this:
>>         This is not a canonicalization problem. Picking an optimal
>>         way to interchange loops may depend on how the result can be
>>         skewed and/or tiled, picking an optimal way to distribute
>>         loops often depends on what can be done afterward in each
>>         piece. Optimal here generally involves reasoning about the
>>         memory hierarchy (e.g., cache properties), available
>>         prefetching streams, register-file size, and so on.
>>         I know that I've seen some good examples in papers over the
>>         years that illustrate the phase-ordering challenges.
>>         Hopefully, someone will jump in here with some good
>>         references. One classic one is: William Pugh. Uniform
>>         Techniques for Loop Optimization. 1991.
>>>         Perhaps concrete examples or perhaps opt-viewer based
>>>         comparisons on large sets of benchmarks? In the big picture
>>>         you could make such a modeling argument for all compiler
>>>         optimizations.
>>         Certainly. However, in this case there's a well-studied
>>         unified model for this set of optimizations known to reduce
>>         phase-ordering effects. That's not true in general.
>>>>         That having been said, I think that integrating this
>>>>         technology into LLVM will also mean applying appropriate
>>>>         modularity. I think that we'll almost definitely want to
>>>>         make use of the dependence analysis separately as an
>>>>         analysis. We'll want to decide which of these
>>>>         transformations will be considered canonicalization (and
>>>>         run in the iterative pipeline) and which will be lowering
>>>>         (and run near the vectorizer). LoopSROA certainly sounds to
>>>>         me like canonicalization, but loop fusion might also fall
>>>>         into that category (i.e., we might want to fuse early to
>>>>         enable optimizations and then split late).
>>>>>         Then there is the question of use cases.  It’s fairly
>>>>>         obvious that anybody wanting to optimize a 5-deep highly
>>>>>         regular loop-nest operating on arrays should use Polly. 
>>>>>         On the other hand it’s way less clear that we should use
>>>>>         it for singly or doubly nested not-so-regular loops which
>>>>>         are the norm in non-HPC workloads.
>>>>         This is clearly a good question, but thinking about Polly
>>>>         as a set of components, not as a monolithic transformation
>>>>         component, I think that polyhedral analysis and
>>>>         transformations can underlie a lot of the transformations
>>>>         we need for non-HPC code (and, which I'll point out, we
>>>>         need for modern HPC code too). In practice, the loops that
>>>>         we can actually analyze have affine dependencies, and Polly
>>>>         does, or can do, a better job at generating runtime
>>>>         predicates and dealing with piecewise-linear expressions
>>>>         than our current infrastructure.
>>>>         In short, I look at Polly as two things: First, an
>>>>         infrastructure for dealing with loop analysis and
>>>>         transformation. I view this as being broadly applicable.
>>>>         Second, an application of that to apply cost-model-driven
>>>>         classic loop transformations. To some extent this is going
>>>>         to be more useful for HPC codes, but also applies to
>>>>         machine learning, signal processing, graphics, and other
>>>>         areas.
>>>         I’m wondering if it could be used for pointing out headroom
>>>         for the existing LLVM ecosystem (*)
>>         Sure.
>>>>>         And this brings me to the maintenance question.  Is it
>>>>>         reasonable to expect people to fix Polly when they have a
>>>>>         seemingly unrelated change that happens to break a Polly bot.
>>>>         The eventual goal here is to have this technology in
>>>>         appropriate parts of the main pipeline, and so the question
>>>>         here is not really about breaking a "Polly bot", but just
>>>>         about a "bot" in general. I've given this question some
>>>>         thought and I think it sits in a reasonable place in the
>>>>         risk-reward space. The answer would be, yes, we'd need to
>>>>         treat this like any other part of the pipeline. However, I
>>>>         believe that Polly has as many, or more, active
>>>>         contributors than essentially any other individual part of
>>>>         the mid-level optimizer or CodeGen. As a result, there will
>>>>         be people around in many time zones to help with problems
>>>>         with Polly-related code.
>>>>>          As far as I know, there were companies in the past that
>>>>>         tried Polly without a whole lot of prior experience.  It
>>>>>         would be great to hear what the experience was before
>>>>>         adopting Polly at a much larger scale.
>>>>         I'm also interested, although I'll caution against
>>>>         over-interpreting any evidence here (positive or negative).
>>>>         Before a few weeks ago, Polly didn't effectively run in the
>>>>         pipeline after inlining, and so I doubt it would have been
>>>>         much use outside of embedded environments (and maybe some
>>>>         HPC environments) with straightforwardly-presented C code.
>>>>         It's only now that this has been fixed that I find the
>>>>         possibility of integrating this in production interesting.
>>>         That is a good point. There are also biases independent of
>>>         past experiences (for disclosure mine is (*) above). But I
>>>         think it is objective to say a Polly integration is a big
>>>         piece to swallow.Your pro-Polly argument lists a number of
>>>         categories that I think could be reasoned about individually
>>>         and partly evaluated with a data-driven approach:
>>>         A) Architecture
>>>         - support for autoparallelism
>>>         - support for accelerators
>>>         - isl- rewrite? etc
>>>         ...
>>>         B) Modelling
>>>         - polyhedral model
>>>         - temporal locality
>>>         - spatial locality
>>>>>>         C) Analysis/Optimizations
>>>         - Dependence Analysis
>>>         - Transformation effective/power (loop nests, quality of
>>>         transformations, #vectorizable loops etc)
>>>         A) is mostly Polly independent (except for the isl question
>>>         I guess). For B and C performance/ compile-time /opt-viewer
>>>         data on a decent/wide range of benchmarks possibly at
>>>         different optimization levels (O2, O3, LTO, PGO etc and
>>>         combinations) should provide data-driven insight into
>>>         costs/benefits.
>>         I agree. In practice, the first question is: Are will willing
>>         to take on Polly (+isl), in whole or in part, as a build
>>         dependency? If the answer is yes, the next question is: what
>>         parts should be reused or refactored for use in other parts
>>         of the pipeline? My argument is that we should take on Polly,
>>         or most of it, as a build dependency. Work on better unifying
>>         the developer communities as we start experimenting with
>>         other kinds of integration. This will, however, allow us to
>>         provide to all of our users these transformations through
>>         pragmas (and other kinds of optional enablement). This is an
>>         important first step.
>>     When you write that we should take Polly as a build dependency:
>>     are you envisioning this to be tied into LLVM to the point where
>>     we can't build LLVM without Polly?
>>     I thought the approach would rather be that a default build of
>>     LLVM would have Polly available but that there will "forever" be
>>     a `-DENABLE_POLLY=OFF` option?
>>     This seems like an important distinction to me, as making it more
>>     integrated but still optional for the correct behavior of LLVM
>>     means that people can continue to work and maintain "forever" an
>>     optimization pipeline which operates without Polly, while
>>     starting to get more integration in a pluggable form. I believe
>>     this is the direction that was taken with the GSoC last year,
>>     trying to get Polly as a "pluggable" dependency analysis.
>     I don't want to make statements about "forever":
> Right, I forgot the asterisk to explain my quotes around "forever": I 
> meant by this "not until we have compelling data and agreement to do 
> otherwise". I think we're on agreement on this though :)

I think we are as well :-)

>     eventually, I'd like to have strong loop optimizations as an
>     integrated part of LLVM. At that point, "Polly" might be little
>     more than a shorthand for some parts of LLVM. Now LLVM currently
>     has an issue whereby it's difficult to build customized versions
>     with only a subset of needed passes, but that's an orthogonal
>     issue. In the short term, it would certainly be separable. I
>     believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option
>     as long as we felt it made sense.
>     By build dependency, I mean: The sources are part of the main
>     source tree, they're built by default, thus essentially all of the
>     builders build them, and they'll end up in many distributions and
>     derived products. This will allow us to build on what's there in
>     whatever ways make sense.
> Thanks for clarifying!
>>         I'm not sure exactly how good this is, but polly has
>>         LNT-submitting bots, so the website can generate a comparison
>>         (e.g.,
>>         http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182
>>         <http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182>).
>>         Looking at this comparison shows a number of potential
>>         problems but also cases where Polly really helps (and, FWIW,
>>         the largest two compile-time regressions are also associated
>>         with very large execution performance improvements). My first
>>         focus would certainly be on pragma-driven enablement.
>>     I'll add: PGO-driven enablement: spending compile-time where we
>>     know it can have an impact. What do you think?
>     Certainly (although, obviously, we still need reasonable
>     confidence that it won't cause performance regressions to do that).
> Oh right, I meant it through an opt-in flag to begin with 
> (-fenable-pgo-polly or something like that).



> -- 
> Mehdi

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170922/a57a5484/attachment-0001.html>

More information about the llvm-dev mailing list