[llvm-dev] [RFC] Polly Status and Integration
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 21 22:33:34 PDT 2017
On 09/22/2017 12:25 AM, Mehdi AMINI wrote:
>
>
> 2017-09-21 22:22 GMT-07:00 Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>>:
>
>
> On 09/22/2017 12:03 AM, Mehdi AMINI wrote:
>> Hi Hal,
>>
>>
>> 2017-09-21 20:59 GMT-07:00 Hal Finkel via llvm-dev
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>:
>>
>>
>> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>>
>>>
>>>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
>>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
>>>> wrote:
>>>>
>>>>
>>>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>>>> Hi Hal, Tobias, Michael and others,
>>>>> *...*
>>>>>
>>>>> One thing that I’d like to see more details on is what
>>>>> this means for the evolution of loop transformations in LLVM.
>>>>>
>>>>> Our more-or-less established direction was so far to
>>>>> incrementally improve and generalize the required analyses
>>>>> (e.g. the LoopVectorizer’s dependence analysis + loop
>>>>> versioning analysis into a stand-alone analysis pass
>>>>> (LoopAccessAnalysis)) and then build new transformations
>>>>> (e.g. LoopDistribution, LoopLoadElimination,
>>>>> LICMLoopVersioning) on top of these.
>>>>>
>>>>> The idea was that infrastructure would be incrementally
>>>>> improved from two directions:
>>>>>
>>>>> - As new transformations are built analyses have to be
>>>>> improved (e.g. past improvements to LAA to support the
>>>>> LoopVersioning utility, future improvements for full
>>>>> LoopSROA beyond just store->load forwarding [1] or the
>>>>> improvements to LAA for the LoopFusion proposal[2])
>>>>>
>>>>> - As more complex loops would have to be analyzed we
>>>>> either improve LAA or make DependenceAnalysis a drop-in
>>>>> replacement for the memory analysis part in LAA
>>>>
>>>> Or we could use Polly's dependence analysis, which I
>>>> believe to be more powerful, more robust, and more correct
>>>> than DependenceAnalysis. I believe that the difficult part
>>>> here is actually the pairing with predicated SCEV or
>>>> whatever mechanism we want to use generate runtime
>>>> predicates (this applies to use of DependenceAnalysis too).
>>>
>>> What is a good way to measure these assertions (More
>>> powerful, more robust)? Are you saying the LLVM Dependence
>>> Analysis is incorrect or do you actually mean less
>>> conservative (or "more accurate" or something like that)?
>>
>> Sebastian's email covers the issues with the
>> DependenceAnalysis pass pretty well.
>>
>> Regarding what's in LoopAccessAnalysis, I believe it to be
>> correct, but more limited. It is not clear to me that LAA is
>> bad at what it does based on what the vectorizer can handle.
>> LAA could do better in some cases with non-unit-stride loops.
>> Polly also handles piecewise-affine functions, which allows
>> the modeling of loops with conditionals. Extending LAA to
>> handle loop nests, moreover, seems likely to be non-trivial.
>>
>> Regardless, measuring these differences certainly seems like
>> a good idea. I think that we can do this using optimization
>> remarks. LAA already emits optimization remarks for loops in
>> which it finds unsafe memory dependencies. Polly also emits
>> optimization remarks. We may need to iterate some in order to
>> setup a good comparison, but we should be able to collect
>> statistics (and other information) by compiling code using
>> -fsave-optimization-record (in combination with some other
>> flags), and then analyzing the resulting YAML files.
>>
>>>>
>>>>>
>>>>> While this model may be slow it has all the benefits of
>>>>> the incremental development model.
>>>>
>>>> The current model may have been slow in many areas, but I
>>>> think that's mostly a question of development effort. My
>>>> largest concern about the current model is that, to the
>>>> extent that we're implementing classic loop transformations
>>>> (e.g., fusion, distribution, interchange, skewing, tiling,
>>>> and so on), we're repeating a historical design that is
>>>> known to have several suboptimal properties. Chief among
>>>> them is the lack of integration: many of these
>>>> transformations are interconnected, and there's no good
>>>> pass ordering in which to make independent decisions. Many
>>>> of these transformations can be captured in a single model
>>>> and we can get much better results by integrating them.
>>>> There's also the matter of whether building these
>>>> transformation on SCEV (or IR directly) is the best
>>>> underlying infrastructure, or whether parts of Polly would
>>>> be better.
>>>
>>> I believe that is true. What I wonder is is there a good
>>> method to reason about it?
>>
>> If I understand what you mean, one way to look at it is this:
>> This is not a canonicalization problem. Picking an optimal
>> way to interchange loops may depend on how the result can be
>> skewed and/or tiled, picking an optimal way to distribute
>> loops often depends on what can be done afterward in each
>> piece. Optimal here generally involves reasoning about the
>> memory hierarchy (e.g., cache properties), available
>> prefetching streams, register-file size, and so on.
>>
>> I know that I've seen some good examples in papers over the
>> years that illustrate the phase-ordering challenges.
>> Hopefully, someone will jump in here with some good
>> references. One classic one is: William Pugh. Uniform
>> Techniques for Loop Optimization. 1991.
>>
>>> Perhaps concrete examples or perhaps opt-viewer based
>>> comparisons on large sets of benchmarks? In the big picture
>>> you could make such a modeling argument for all compiler
>>> optimizations.
>>
>> Certainly. However, in this case there's a well-studied
>> unified model for this set of optimizations known to reduce
>> phase-ordering effects. That's not true in general.
>>
>>>>
>>>> That having been said, I think that integrating this
>>>> technology into LLVM will also mean applying appropriate
>>>> modularity. I think that we'll almost definitely want to
>>>> make use of the dependence analysis separately as an
>>>> analysis. We'll want to decide which of these
>>>> transformations will be considered canonicalization (and
>>>> run in the iterative pipeline) and which will be lowering
>>>> (and run near the vectorizer). LoopSROA certainly sounds to
>>>> me like canonicalization, but loop fusion might also fall
>>>> into that category (i.e., we might want to fuse early to
>>>> enable optimizations and then split late).
>>>>
>>>>>
>>>>> Then there is the question of use cases. It’s fairly
>>>>> obvious that anybody wanting to optimize a 5-deep highly
>>>>> regular loop-nest operating on arrays should use Polly.
>>>>> On the other hand it’s way less clear that we should use
>>>>> it for singly or doubly nested not-so-regular loops which
>>>>> are the norm in non-HPC workloads.
>>>>
>>>> This is clearly a good question, but thinking about Polly
>>>> as a set of components, not as a monolithic transformation
>>>> component, I think that polyhedral analysis and
>>>> transformations can underlie a lot of the transformations
>>>> we need for non-HPC code (and, which I'll point out, we
>>>> need for modern HPC code too). In practice, the loops that
>>>> we can actually analyze have affine dependencies, and Polly
>>>> does, or can do, a better job at generating runtime
>>>> predicates and dealing with piecewise-linear expressions
>>>> than our current infrastructure.
>>>>
>>>> In short, I look at Polly as two things: First, an
>>>> infrastructure for dealing with loop analysis and
>>>> transformation. I view this as being broadly applicable.
>>>> Second, an application of that to apply cost-model-driven
>>>> classic loop transformations. To some extent this is going
>>>> to be more useful for HPC codes, but also applies to
>>>> machine learning, signal processing, graphics, and other
>>>> areas.
>>> I’m wondering if it could be used for pointing out headroom
>>> for the existing LLVM ecosystem (*)
>>
>> Sure.
>>
>>>
>>>>
>>>>>
>>>>> And this brings me to the maintenance question. Is it
>>>>> reasonable to expect people to fix Polly when they have a
>>>>> seemingly unrelated change that happens to break a Polly bot.
>>>>
>>>> The eventual goal here is to have this technology in
>>>> appropriate parts of the main pipeline, and so the question
>>>> here is not really about breaking a "Polly bot", but just
>>>> about a "bot" in general. I've given this question some
>>>> thought and I think it sits in a reasonable place in the
>>>> risk-reward space. The answer would be, yes, we'd need to
>>>> treat this like any other part of the pipeline. However, I
>>>> believe that Polly has as many, or more, active
>>>> contributors than essentially any other individual part of
>>>> the mid-level optimizer or CodeGen. As a result, there will
>>>> be people around in many time zones to help with problems
>>>> with Polly-related code.
>>>>
>>>>> As far as I know, there were companies in the past that
>>>>> tried Polly without a whole lot of prior experience. It
>>>>> would be great to hear what the experience was before
>>>>> adopting Polly at a much larger scale.
>>>>
>>>> I'm also interested, although I'll caution against
>>>> over-interpreting any evidence here (positive or negative).
>>>> Before a few weeks ago, Polly didn't effectively run in the
>>>> pipeline after inlining, and so I doubt it would have been
>>>> much use outside of embedded environments (and maybe some
>>>> HPC environments) with straightforwardly-presented C code.
>>>> It's only now that this has been fixed that I find the
>>>> possibility of integrating this in production interesting.
>>>
>>> That is a good point. There are also biases independent of
>>> past experiences (for disclosure mine is (*) above). But I
>>> think it is objective to say a Polly integration is a big
>>> piece to swallow.Your pro-Polly argument lists a number of
>>> categories that I think could be reasoned about individually
>>> and partly evaluated with a data-driven approach:
>>> A) Architecture
>>> - support for autoparallelism
>>> - support for accelerators
>>> - isl- rewrite? etc
>>> ...
>>> B) Modelling
>>> - polyhedral model
>>> - temporal locality
>>> - spatial locality
>>> …
>>> C) Analysis/Optimizations
>>> - Dependence Analysis
>>> - Transformation effective/power (loop nests, quality of
>>> transformations, #vectorizable loops etc)
>>>
>>> A) is mostly Polly independent (except for the isl question
>>> I guess). For B and C performance/ compile-time /opt-viewer
>>> data on a decent/wide range of benchmarks possibly at
>>> different optimization levels (O2, O3, LTO, PGO etc and
>>> combinations) should provide data-driven insight into
>>> costs/benefits.
>>
>> I agree. In practice, the first question is: Are will willing
>> to take on Polly (+isl), in whole or in part, as a build
>> dependency? If the answer is yes, the next question is: what
>> parts should be reused or refactored for use in other parts
>> of the pipeline? My argument is that we should take on Polly,
>> or most of it, as a build dependency. Work on better unifying
>> the developer communities as we start experimenting with
>> other kinds of integration. This will, however, allow us to
>> provide to all of our users these transformations through
>> pragmas (and other kinds of optional enablement). This is an
>> important first step.
>>
>>
>> When you write that we should take Polly as a build dependency:
>> are you envisioning this to be tied into LLVM to the point where
>> we can't build LLVM without Polly?
>> I thought the approach would rather be that a default build of
>> LLVM would have Polly available but that there will "forever" be
>> a `-DENABLE_POLLY=OFF` option?
>> This seems like an important distinction to me, as making it more
>> integrated but still optional for the correct behavior of LLVM
>> means that people can continue to work and maintain "forever" an
>> optimization pipeline which operates without Polly, while
>> starting to get more integration in a pluggable form. I believe
>> this is the direction that was taken with the GSoC last year,
>> trying to get Polly as a "pluggable" dependency analysis.
>
> I don't want to make statements about "forever":
>
>
> Right, I forgot the asterisk to explain my quotes around "forever": I
> meant by this "not until we have compelling data and agreement to do
> otherwise". I think we're on agreement on this though :)
I think we are as well :-)
>
> eventually, I'd like to have strong loop optimizations as an
> integrated part of LLVM. At that point, "Polly" might be little
> more than a shorthand for some parts of LLVM. Now LLVM currently
> has an issue whereby it's difficult to build customized versions
> with only a subset of needed passes, but that's an orthogonal
> issue. In the short term, it would certainly be separable. I
> believe that we'd certainly keep the `-DENABLE_POLLY=OFF` option
> as long as we felt it made sense.
>
> By build dependency, I mean: The sources are part of the main
> source tree, they're built by default, thus essentially all of the
> builders build them, and they'll end up in many distributions and
> derived products. This will allow us to build on what's there in
> whatever ways make sense.
>
>
> Thanks for clarifying!
>
>>
>> I'm not sure exactly how good this is, but polly has
>> LNT-submitting bots, so the website can generate a comparison
>> (e.g.,
>> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182
>> <http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182>).
>> Looking at this comparison shows a number of potential
>> problems but also cases where Polly really helps (and, FWIW,
>> the largest two compile-time regressions are also associated
>> with very large execution performance improvements). My first
>> focus would certainly be on pragma-driven enablement.
>>
>>
>> I'll add: PGO-driven enablement: spending compile-time where we
>> know it can have an impact. What do you think?
>
> Certainly (although, obviously, we still need reasonable
> confidence that it won't cause performance regressions to do that).
>
>
> Oh right, I meant it through an opt-in flag to begin with
> (-fenable-pgo-polly or something like that).
SGTM.
-Hal
>
> --
> Mehdi
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170922/a57a5484/attachment-0001.html>
More information about the llvm-dev
mailing list