[llvm-dev] [RFC] Polly Status and Integration

Thu Sep 21 22:25:45 PDT 2017

2017-09-21 22:22 GMT-07:00 Hal Finkel <hfinkel at anl.gov>:

>
> On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
>
> Hi Hal,
>
>
> 2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev <
> llvm-dev at lists.llvm.org>:
>
>>
>> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>
>>
>>
>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>
>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>
>> Hi Hal, Tobias, Michael and others,
>> *...*
>>
>> One thing that I’d like to see more details on is what this means for the
>> evolution of loop transformations in LLVM.
>>
>> Our more-or-less established direction was so far to incrementally
>> improve and generalize the required analyses (e.g. the LoopVectorizer’s
>> dependence analysis + loop versioning analysis into a stand-alone analysis
>> pass (LoopAccessAnalysis)) and then build new transformations (e.g.
>> LoopDistribution, LoopLoadElimination, LICMLoopVersioning) on top of these.
>>
>> The idea was that infrastructure would be incrementally improved from two
>> directions:
>>
>> - As new transformations are built analyses have to be improved (e.g.
>> past improvements to LAA to support the LoopVersioning utility, future
>> improvements for full LoopSROA beyond just store->load forwarding [1] or
>> the improvements to LAA for the LoopFusion proposal[2])
>>
>> - As more complex loops would have to be analyzed we either improve LAA
>> or make DependenceAnalysis a drop-in replacement for the memory analysis
>> part in LAA
>>
>>
>> Or we could use Polly's dependence analysis, which I believe to be more
>> powerful, more robust, and more correct than DependenceAnalysis. I believe
>> that the difficult part here is actually the pairing with predicated SCEV
>> or whatever mechanism we want to use generate runtime predicates (this
>> applies to use of DependenceAnalysis too).
>>
>>
>> What is a good way to measure these assertions (More powerful, more
>> robust)? Are you saying the LLVM Dependence Analysis is incorrect or do you
>> actually mean less conservative (or "more accurate" or something like that)?
>>
>>
>> Sebastian's email covers the issues with the DependenceAnalysis pass
>> pretty well.
>>
>> Regarding what's in LoopAccessAnalysis, I believe it to be correct, but
>> more limited. It is not clear to me that LAA is bad at what it does based
>> on what the vectorizer can handle. LAA could do better in some cases with
>> non-unit-stride loops. Polly also handles piecewise-affine functions, which
>> allows the modeling of loops with conditionals. Extending LAA to handle
>> loop nests, moreover, seems likely to be non-trivial.
>>
>> Regardless, measuring these differences certainly seems like a good idea.
>> I think that we can do this using optimization remarks. LAA already emits
>> optimization remarks for loops in which it finds unsafe memory
>> dependencies. Polly also emits optimization remarks. We may need to iterate
>> some in order to setup a good comparison, but we should be able to collect
>> statistics (and other information) by compiling code using
>> -fsave-optimization-record (in combination with some other flags), and then
>> analyzing the resulting YAML files.
>>
>>
>>
>> While this model may be slow it has all the benefits of the incremental
>> development model.
>>
>>
>> The current model may have been slow in many areas, but I think that's
>> mostly a question of development effort. My largest concern about the
>> current model is that, to the extent that we're implementing classic loop
>> transformations (e.g., fusion, distribution, interchange, skewing, tiling,
>> and so on), we're repeating a historical design that is known to have
>> several suboptimal properties. Chief among them is the lack of integration:
>> many of these transformations are interconnected, and there's no good pass
>> ordering in which to make independent decisions. Many of these
>> transformations can be captured in a single model and we can get much
>> better results by integrating them. There's also the matter of whether
>> building these transformation on SCEV (or IR directly) is the best
>> underlying infrastructure, or whether parts of Polly would be better.
>>
>>
>> I believe that is true. What I wonder is is there a good method to reason
>> about it?
>>
>>
>> If I understand what you mean, one way to look at it is this: This is not
>> a canonicalization problem. Picking an optimal way to interchange loops may
>> depend on how the result can be skewed and/or tiled, picking an optimal way
>> to distribute loops often depends on what can be done afterward in each
>> piece. Optimal here generally involves reasoning about the memory hierarchy
>> (e.g., cache properties), available prefetching streams, register-file
>> size, and so on.
>>
>> I know that I've seen some good examples in papers over the years that
>> illustrate the phase-ordering challenges. Hopefully, someone will jump in
>> here with some good references. One classic one is: William Pugh. Uniform
>> Techniques for Loop Optimization. 1991.
>>
>> Perhaps concrete examples or perhaps opt-viewer based comparisons on
>> large sets of benchmarks? In the big picture you could make such a modeling
>> argument for all compiler optimizations.
>>
>>
>> Certainly. However, in this case there's a well-studied unified model for
>> this set of optimizations known to reduce phase-ordering effects. That's
>> not true in general.
>>
>>
>> That having been said, I think that integrating this technology into LLVM
>> will also mean applying appropriate modularity. I think that we'll almost
>> definitely want to make use of the dependence analysis separately as an
>> analysis. We'll want to decide which of these transformations will be
>> considered canonicalization (and run in the iterative pipeline) and which
>> will be lowering (and run near the vectorizer). LoopSROA certainly sounds
>> to me like canonicalization, but loop fusion might also fall into that
>> category (i.e., we might want to fuse early to enable optimizations and
>> then split late).
>>
>>
>> Then there is the question of use cases.  It’s fairly obvious that
>> anybody wanting to optimize a 5-deep highly regular loop-nest operating on
>> arrays should use Polly.  On the other hand it’s way less clear that we
>> should use it for singly or doubly nested not-so-regular loops which are
>> the norm in non-HPC workloads.
>>
>>
>> This is clearly a good question, but thinking about Polly as a set of
>> components, not as a monolithic transformation component, I think that
>> polyhedral analysis and transformations can underlie a lot of the
>> transformations we need for non-HPC code (and, which I'll point out, we
>> need for modern HPC code too). In practice, the loops that we can actually
>> analyze have affine dependencies, and Polly does, or can do, a better job
>> at generating runtime predicates and dealing with piecewise-linear
>> expressions than our current infrastructure.
>>
>> In short, I look at Polly as two things: First, an infrastructure for
>> dealing with loop analysis and transformation. I view this as being broadly
>> applicable. Second, an application of that to apply cost-model-driven
>> classic loop transformations. To some extent this is going to be more
>> useful for HPC codes, but also applies to machine learning, signal
>> processing, graphics, and other areas.
>>
>> I’m wondering if it could be used for pointing out headroom for the
>> existing LLVM ecosystem (*)
>>
>>
>> Sure.
>>
>>
>>
>>
>> And this brings me to the maintenance question.  Is it reasonable to
>> expect people to fix Polly when they have a seemingly unrelated change that
>> happens to break a Polly bot.
>>
>>
>> The eventual goal here is to have this technology in appropriate parts of
>> the main pipeline, and so the question here is not really about breaking a
>> "Polly bot", but just about a "bot" in general. I've given this question
>> some thought and I think it sits in a reasonable place in the risk-reward
>> space. The answer would be, yes, we'd need to treat this like any other
>> part of the pipeline. However, I believe that Polly has as many, or more,
>> active contributors than essentially any other individual part of the
>> mid-level optimizer or CodeGen. As a result, there will be people around in
>> many time zones to help with problems with Polly-related code.
>>
>>  As far as I know, there were companies in the past that tried Polly
>> without a whole lot of prior experience.  It would be great to hear what
>> the experience was before adopting Polly at a much larger scale.
>>
>>
>> I'm also interested, although I'll caution against over-interpreting any
>> evidence here (positive or negative). Before a few weeks ago, Polly didn't
>> effectively run in the pipeline after inlining, and so I doubt it would
>> have been much use outside of embedded environments (and maybe some HPC
>> environments) with straightforwardly-presented C code. It's only now that
>> this has been fixed that I find the possibility of integrating this in
>> production interesting.
>>
>>
>> That is a good point. There are also biases independent of past
>> experiences (for disclosure mine is (*) above). But I think it is objective
>> to say a Polly integration is a big piece to swallow.Your pro-Polly
>> argument lists a number of categories that I think could be reasoned about
>> individually and partly evaluated with a data-driven approach:
>> A) Architecture
>> - support for autoparallelism
>> - support for accelerators
>> - isl- rewrite? etc
>> ...
>> B) Modelling
>> - polyhedral model
>> - temporal locality
>> - spatial locality
>> …
>> C) Analysis/Optimizations
>> - Dependence Analysis
>> - Transformation effective/power (loop nests, quality of transformations,
>> #vectorizable loops etc)
>>
>> A) is mostly Polly independent (except for the isl question I guess). For
>> B and C performance/ compile-time /opt-viewer data on a decent/wide range
>> of benchmarks possibly at different optimization levels (O2, O3, LTO, PGO
>> etc and combinations) should provide data-driven insight into
>> costs/benefits.
>>
>>
>> I agree. In practice, the first question is: Are will willing to take on
>> Polly (+isl), in whole or in part, as a build dependency? If the answer is
>> yes, the next question is: what parts should be reused or refactored for
>> use in other parts of the pipeline? My argument is that we should take on
>> Polly, or most of it, as a build dependency. Work on better unifying the
>> developer communities as we start experimenting with other kinds of
>> integration. This will, however, allow us to provide to all of our users
>> these transformations through pragmas (and other kinds of optional
>> enablement). This is an important first step.
>>
>
> When you write that we should take Polly as a build dependency: are you
> envisioning this to be tied into LLVM to the point where we can't build
> LLVM without Polly?
> I thought the approach would rather be that a default build of LLVM would
> have Polly available but that there will "forever" be a
> `-DENABLE_POLLY=OFF` option?
> This seems like an important distinction to me, as making it more
> integrated but still optional for the correct behavior of LLVM means that
> people can continue to work and maintain "forever" an optimization pipeline
> which operates without Polly, while starting to get more integration in a
> pluggable form. I believe this is the direction that was taken with the
> GSoC last year, trying to get Polly as a "pluggable" dependency analysis.
>
>
> I don't want to make statements about "forever":
>

Right, I forgot the asterisk to explain my quotes around "forever": I meant
by this "not until we have compelling data and agreement to do otherwise".
I think we're on agreement on this though :)

> eventually, I'd like to have strong loop optimizations as an integrated
> part of LLVM. At that point, "Polly" might be little more than a shorthand
> for some parts of LLVM. Now LLVM currently has an issue whereby it's
> difficult to build customized versions with only a subset of needed passes,
> but that's an orthogonal issue. In the short term, it would certainly be
> separable. I believe that we'd certainly keep the `-DENABLE_POLLY=OFF`
> option as long as we felt it made sense.
>
> By build dependency, I mean: The sources are part of the main source tree,
> they're built by default, thus essentially all of the builders build them,
> and they'll end up in many distributions and derived products. This will
> allow us to build on what's there in whatever ways make sense.
>

Thanks for clarifying!

>
>> I'm not sure exactly how good this is, but polly has LNT-submitting bots,
>> so the website can generate a comparison (e.g.,
>> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking
>> at this comparison shows a number of potential problems but also cases
>> where Polly really helps (and, FWIW, the largest two compile-time
>> regressions are also associated with very large execution performance
>> improvements). My first focus would certainly be on pragma-driven
>> enablement.
>>
>
> I'll add: PGO-driven enablement: spending compile-time where we know it
> can have an impact. What do you think?
>
>
> Certainly (although, obviously, we still need reasonable confidence that
> it won't cause performance regressions to do that).
>

Oh right, I meant it through an opt-in flag to begin with
(-fenable-pgo-polly or something like that).

-- 
Mehdi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170921/a389064a/attachment.html>