[llvm-dev] [RFC] Polly Status and Integration
Hal Finkel via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 21 20:59:27 PDT 2017
On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>> On Sep 11, 2017, at 10:47 PM, Hal Finkel via llvm-dev
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> On 09/11/2017 12:26 PM, Adam Nemet wrote:
>>> Hi Hal, Tobias, Michael and others,
>>> One thing that I’d like to see more details on is what this means
>>> for the evolution of loop transformations in LLVM.
>>> Our more-or-less established direction was so far to incrementally
>>> improve and generalize the required analyses (e.g. the
>>> LoopVectorizer’s dependence analysis + loop versioning analysis into
>>> a stand-alone analysis pass (LoopAccessAnalysis)) and then build new
>>> transformations (e.g. LoopDistribution, LoopLoadElimination,
>>> LICMLoopVersioning) on top of these.
>>> The idea was that infrastructure would be incrementally improved
>>> from two directions:
>>> - As new transformations are built analyses have to be improved
>>> (e.g. past improvements to LAA to support the LoopVersioning
>>> utility, future improvements for full LoopSROA beyond just
>>> store->load forwarding  or the improvements to LAA for the
>>> LoopFusion proposal)
>>> - As more complex loops would have to be analyzed we either improve
>>> LAA or make DependenceAnalysis a drop-in replacement for the memory
>>> analysis part in LAA
>> Or we could use Polly's dependence analysis, which I believe to be
>> more powerful, more robust, and more correct than DependenceAnalysis.
>> I believe that the difficult part here is actually the pairing with
>> predicated SCEV or whatever mechanism we want to use generate runtime
>> predicates (this applies to use of DependenceAnalysis too).
> What is a good way to measure these assertions (More powerful, more
> robust)? Are you saying the LLVM Dependence Analysis is incorrect or
> do you actually mean less conservative (or "more accurate" or
> something like that)?
Sebastian's email covers the issues with the DependenceAnalysis pass
Regarding what's in LoopAccessAnalysis, I believe it to be correct, but
more limited. It is not clear to me that LAA is bad at what it does
based on what the vectorizer can handle. LAA could do better in some
cases with non-unit-stride loops. Polly also handles piecewise-affine
functions, which allows the modeling of loops with conditionals.
Extending LAA to handle loop nests, moreover, seems likely to be
Regardless, measuring these differences certainly seems like a good
idea. I think that we can do this using optimization remarks. LAA
already emits optimization remarks for loops in which it finds unsafe
memory dependencies. Polly also emits optimization remarks. We may need
to iterate some in order to setup a good comparison, but we should be
able to collect statistics (and other information) by compiling code
using -fsave-optimization-record (in combination with some other flags),
and then analyzing the resulting YAML files.
>>> While this model may be slow it has all the benefits of the
>>> incremental development model.
>> The current model may have been slow in many areas, but I think
>> that's mostly a question of development effort. My largest concern
>> about the current model is that, to the extent that we're
>> implementing classic loop transformations (e.g., fusion,
>> distribution, interchange, skewing, tiling, and so on), we're
>> repeating a historical design that is known to have several
>> suboptimal properties. Chief among them is the lack of integration:
>> many of these transformations are interconnected, and there's no good
>> pass ordering in which to make independent decisions. Many of these
>> transformations can be captured in a single model and we can get much
>> better results by integrating them. There's also the matter of
>> whether building these transformation on SCEV (or IR directly) is the
>> best underlying infrastructure, or whether parts of Polly would be
> I believe that is true. What I wonder is is there a good method to
> reason about it?
If I understand what you mean, one way to look at it is this: This is
not a canonicalization problem. Picking an optimal way to interchange
loops may depend on how the result can be skewed and/or tiled, picking
an optimal way to distribute loops often depends on what can be done
afterward in each piece. Optimal here generally involves reasoning about
the memory hierarchy (e.g., cache properties), available prefetching
streams, register-file size, and so on.
I know that I've seen some good examples in papers over the years that
illustrate the phase-ordering challenges. Hopefully, someone will jump
in here with some good references. One classic one is: William Pugh.
Uniform Techniques for Loop Optimization. 1991.
> Perhaps concrete examples or perhaps opt-viewer based comparisons on
> large sets of benchmarks? In the big picture you could make such a
> modeling argument for all compiler optimizations.
Certainly. However, in this case there's a well-studied unified model
for this set of optimizations known to reduce phase-ordering effects.
That's not true in general.
>> That having been said, I think that integrating this technology into
>> LLVM will also mean applying appropriate modularity. I think that
>> we'll almost definitely want to make use of the dependence analysis
>> separately as an analysis. We'll want to decide which of these
>> transformations will be considered canonicalization (and run in the
>> iterative pipeline) and which will be lowering (and run near the
>> vectorizer). LoopSROA certainly sounds to me like canonicalization,
>> but loop fusion might also fall into that category (i.e., we might
>> want to fuse early to enable optimizations and then split late).
>>> Then there is the question of use cases. It’s fairly obvious that
>>> anybody wanting to optimize a 5-deep highly regular loop-nest
>>> operating on arrays should use Polly. On the other hand it’s way
>>> less clear that we should use it for singly or doubly nested
>>> not-so-regular loops which are the norm in non-HPC workloads.
>> This is clearly a good question, but thinking about Polly as a set of
>> components, not as a monolithic transformation component, I think
>> that polyhedral analysis and transformations can underlie a lot of
>> the transformations we need for non-HPC code (and, which I'll point
>> out, we need for modern HPC code too). In practice, the loops that we
>> can actually analyze have affine dependencies, and Polly does, or can
>> do, a better job at generating runtime predicates and dealing with
>> piecewise-linear expressions than our current infrastructure.
>> In short, I look at Polly as two things: First, an infrastructure for
>> dealing with loop analysis and transformation. I view this as being
>> broadly applicable. Second, an application of that to apply
>> cost-model-driven classic loop transformations. To some extent this
>> is going to be more useful for HPC codes, but also applies to machine
>> learning, signal processing, graphics, and other areas.
> I’m wondering if it could be used for pointing out headroom for the
> existing LLVM ecosystem (*)
>>> And this brings me to the maintenance question. Is it reasonable to
>>> expect people to fix Polly when they have a seemingly unrelated
>>> change that happens to break a Polly bot.
>> The eventual goal here is to have this technology in appropriate
>> parts of the main pipeline, and so the question here is not really
>> about breaking a "Polly bot", but just about a "bot" in general. I've
>> given this question some thought and I think it sits in a reasonable
>> place in the risk-reward space. The answer would be, yes, we'd need
>> to treat this like any other part of the pipeline. However, I believe
>> that Polly has as many, or more, active contributors than essentially
>> any other individual part of the mid-level optimizer or CodeGen. As a
>> result, there will be people around in many time zones to help with
>> problems with Polly-related code.
>>> As far as I know, there were companies in the past that tried Polly
>>> without a whole lot of prior experience. It would be great to hear
>>> what the experience was before adopting Polly at a much larger scale.
>> I'm also interested, although I'll caution against over-interpreting
>> any evidence here (positive or negative). Before a few weeks ago,
>> Polly didn't effectively run in the pipeline after inlining, and so I
>> doubt it would have been much use outside of embedded environments
>> (and maybe some HPC environments) with straightforwardly-presented C
>> code. It's only now that this has been fixed that I find the
>> possibility of integrating this in production interesting.
> That is a good point. There are also biases independent of past
> experiences (for disclosure mine is (*) above). But I think it is
> objective to say a Polly integration is a big piece to swallow.Your
> pro-Polly argument lists a number of categories that I think could be
> reasoned about individually and partly evaluated with a data-driven
> A) Architecture
> - support for autoparallelism
> - support for accelerators
> - isl- rewrite? etc
> B) Modelling
> - polyhedral model
> - temporal locality
> - spatial locality
> C) Analysis/Optimizations
> - Dependence Analysis
> - Transformation effective/power (loop nests, quality of
> transformations, #vectorizable loops etc)
> A) is mostly Polly independent (except for the isl question I guess).
> For B and C performance/ compile-time /opt-viewer data on a
> decent/wide range of benchmarks possibly at different optimization
> levels (O2, O3, LTO, PGO etc and combinations) should provide
> data-driven insight into costs/benefits.
I agree. In practice, the first question is: Are will willing to take on
Polly (+isl), in whole or in part, as a build dependency? If the answer
is yes, the next question is: what parts should be reused or refactored
for use in other parts of the pipeline? My argument is that we should
take on Polly, or most of it, as a build dependency. Work on better
unifying the developer communities as we start experimenting with other
kinds of integration. This will, however, allow us to provide to all of
our users these transformations through pragmas (and other kinds of
optional enablement). This is an important first step.
I'm not sure exactly how good this is, but polly has LNT-submitting
bots, so the website can generate a comparison (e.g.,
at this comparison shows a number of potential problems but also cases
where Polly really helps (and, FWIW, the largest two compile-time
regressions are also associated with very large execution performance
improvements). My first focus would certainly be on pragma-driven
>> Thanks again,
>>>  http://lists.llvm.org/pipermail/llvm-dev/2015-November/092017.html
>>>  http://lists.llvm.org/pipermail/llvm-dev/2016-March/096266.html
>>>> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
>>>> feedback from**several other active Polly developers)
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages
>>>> Leadership Computing Facility
>>>> Argonne National Laboratory
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev