[llvm-dev] [RFC] Polly Status and Integration

Tue Sep 19 09:21:59 PDT 2017

Hi Adam,

thanks for your input. Hal already answered most questions, but let me
complement his reply with a couple of data points:

On Tue, Sep 12, 2017, at 07:47, Hal Finkel wrote:
> 
> On 09/11/2017 12:26 PM, Adam Nemet wrote:
> > Hi Hal, Tobias, Michael and others,
> >
> >> On Sep 1, 2017, at 11:47 AM, Hal Finkel via llvm-dev 
> >> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
> >>
> >> **
> >>
> >> *Hi everyone,As you may know, stock LLVM does not provide the kind of 
> >> advanced loop transformations necessary to provide good performance 
> >> on many applications. LLVM's Polly project provides many of the 
> >> required capabilities, including loop transformations such as 
> >> fission, fusion, skewing, blocking/tiling, and interchange, all 
> >> powered by state-of-the-art dependence analysis. Polly also provides 
> >> automated parallelization and targeting of GPUs and other**accelerators.*
> >> *
> >> Over the past year, Polly’s development has focused on robustness, 
> >> correctness, and closer integration with LLVM. To highlight a few 
> >> accomplishments:
> >>
> >>  *
> >>     Polly now runs, by default, in the conceptually-proper place in
> >>     LLVM’s pass pipeline (just before the loop vectorizer).
> >>     Importantly, this means that its loop transformations are
> >>     performed after inlining and other canonicalization, greatly
> >>     increasing its robustness, and enabling its use on C++ code
> >>     (where [] is often a function call before inlining).
> >>  *
> >>     Polly’s cost-modeling parameters, such as those describing the
> >>     target’s memory hierarchy, are being integrated with
> >>     TargetTransformInfo. This allows targets to properly override the
> >>     modeling parameters and allows reuse of these parameters by other
> >>     clients.
> >>  *
> >>     Polly’s method of handling signed division/remainder operations,
> >>     which worked around lack of support in ScalarEvolution, is being
> >>     replaced thanks to improvements being contributed to
> >>     ScalarEvolution itself (see D34598). Polly’s core delinearization
> >>     routines have long been a part of LLVM itself.
> >>  *
> >>     PolyhedralInfo, which exposes a subset of Polly’s loop analysis
> >>     for use by other clients, is now available.
> >>  *
> >>     Polly is now part of the LLVM release process and is being
> >>     included with LLVM by various packagers (e.g., Debian).
> >>
> >>
> >> I believe that the LLVM community would benefit from beginning the 
> >> process of integrating Polly with LLVM itself and continuing its 
> >> development as part of our main code base. This will:
> >>
> >>  *
> >>     Allow for wider adoption of LLVM within communities relying on
> >>     advanced loop transformations.
> >>  *
> >>     Provide for better community feedback on, and testing of, the
> >>     code developed (although the story in this regard is already
> >>     fairly solid).
> >>  *
> >>     Better motivate targets to provide accurate, comprehensive,
> >>     modeling parameters for use by advanced loop transformations.
> >>  *
> >>     Perhaps most importantly, this will allow us to develop and tune
> >>     the rest of the optimizer assuming that Polly’s capabilities are
> >>     present (the underlying analysis, and eventually, the
> >>     transformations themselves).
> >>
> >>
> >> The largest issue on which community consensus is required, in order 
> >> to move forward at all, is what to do with isl. isl, the Integer Set 
> >> Library, provides core functionality on which Polly depends. It is a 
> >> C library, and while some Polly/LLVM developers are also isl 
> >> developers, it has a large user community outside of LLVM/Polly. A 
> >> C++ interface was recently added, and Polly is transitioning to use 
> >> the C++ interface. Nevertheless, options here include rewriting the 
> >> needed functionality, forking isl and transitioning our fork toward 
> >> LLVM coding conventions (and data structures) over time, and 
> >> incorporating isl more-or-less as-is to avoid partitioning its 
> >> development.
> >>
> >> That having been said, isl is internally modular, and regardless of 
> >> the overall integration strategy, the Polly developers anticipate 
> >> specializing, or even replacing, some of these components with 
> >> LLVM-specific solutions. This is especially true for anything that 
> >> touches performance-related heuristics and modeling. LLVM-specific, 
> >> or even target-specific, loop schedulers may be developed as well.
> >>
> >> Even though some developers in the LLVM community already have a 
> >> background in polyhedral-modeling techniques, the Polly developers 
> >> have developed, and are still developing, extensive tutorials on this 
> >> topic http://pollylabs.org/education.htmland especially 
> >> http://playground.pollylabs.org <http://playground.pollylabs.org/>.
> >>
> >> Finally, let me highlight a few ongoing development efforts in Polly 
> >> that are potentially relevant to this discussion. Polly’s loop 
> >> analysis is sound and technically superior to what’s in LLVM 
> >> currently (i.e. in LoopAccessAnalysis and DependenceAnalysis). There 
> >> are, however, two known reasons why Polly’s transformations could not 
> >> yet be enabled by default:
> >>
> >>  *
> >>     A correctness issue: Currently, Polly assumes that 64 bits is
> >>     large enough for all new loop-induction variables and index
> >>     expressions. In rare cases, transformations could be performed
> >>     where more bits are required. Preconditions need to be generated
> >>     preventing this (e.g., D35471).
> >>  *
> >>     A performance issue: Polly currently models temporal locality
> >>     (i.e., it tries to get better reuse in time), but does not model
> >>     spatial locality (i.e., it does not model cache-line reuse). As a
> >>     result, it can sometimes introduce performance regressions. Polly
> >>     Labs is currently working on integrating spatial locality
> >>     modeling into the loop optimization model.
> >>
> >> Polly can already split apart basic blocks in order to implement loop 
> >> fusion. Heuristics to choose at which granularity are still being 
> >> implemented (e.g., PR12402).
> >> I believe that we can now develop a concrete plan for moving 
> >> state-of-the-art loop optimizations, based on the technology in the 
> >> Polly project, into LLVM. Doing so will enable LLVM to be competitive 
> >> with proprietary compilers in high-performance computing, machine 
> >> learning, and other important application domains. I’d like community 
> >> feedback on what**should be part of that plan.
> >> *
> >
> > One thing that I’d like to see more details on is what this means for 
> > the evolution of loop transformations in LLVM.
> >
> > Our more-or-less established direction was so far to incrementally 
> > improve and generalize the required analyses (e.g. the 
> > LoopVectorizer’s dependence analysis + loop versioning analysis into a 
> > stand-alone analysis pass (LoopAccessAnalysis)) and then build new 
> > transformations (e.g. LoopDistribution, LoopLoadElimination, 
> > LICMLoopVersioning) on top of these.
> >
> > The idea was that infrastructure would be incrementally improved from 
> > two directions:
> >
> > - As new transformations are built analyses have to be improved (e.g. 
> > past improvements to LAA to support the LoopVersioning utility, future 
> > improvements for full LoopSROA beyond just store->load forwarding [1] 
> > or the improvements to LAA for the LoopFusion proposal[2])
> >
> > - As more complex loops would have to be analyzed we either improve 
> > LAA or make DependenceAnalysis a drop-in replacement for the memory 
> > analysis part in LAA
> 
> Or we could use Polly's dependence analysis, which I believe to be more 
> powerful, more robust, and more correct than DependenceAnalysis. I 
> believe that the difficult part here is actually the pairing with 
> predicated SCEV or whatever mechanism we want to use generate runtime 
> predicates (this applies to use of DependenceAnalysis too).
>
> > While this model may be slow it has all the benefits of the 
> > incremental development model.
> 
> The current model may have been slow in many areas, but I think that's 
> mostly a question of development effort. My largest concern about the 
> current model is that, to the extent that we're implementing classic 
> loop transformations (e.g., fusion, distribution, interchange, skewing, 
> tiling, and so on), we're repeating a historical design that is known to 
> have several suboptimal properties. Chief among them is the lack of 
> integration: many of these transformations are interconnected, and 
> there's no good pass ordering in which to make independent decisions. 
> Many of these transformations can be captured in a single model and we 
> can get much better results by integrating them. There's also the matter 
> of whether building these transformation on SCEV (or IR directly) is the 
> best underlying infrastructure, or whether parts of Polly would be
> better.
> 
> That having been said, I think that integrating this technology into 
> LLVM will also mean applying appropriate modularity. I think that we'll 
> almost definitely want to make use of the dependence analysis separately 
> as an analysis. We'll want to decide which of these transformations will 
> be considered canonicalization (and run in the iterative pipeline) and 
> which will be lowering (and run near the vectorizer). LoopSROA certainly 
> sounds to me like canonicalization, but loop fusion might also fall into 
> that category (i.e., we might want to fuse early to enable optimizations 
> and then split late).
>
> > Then there is the question of use cases.  It’s fairly obvious that 
> > anybody wanting to optimize a 5-deep highly regular loop-nest 
> > operating on arrays should use Polly.  On the other hand it’s way less 
> > clear that we should use it for singly or doubly nested not-so-regular 
> > loops which are the norm in non-HPC workloads.
> 
> This is clearly a good question, but thinking about Polly as a set of 
> components, not as a monolithic transformation component, I think that 
> polyhedral analysis and transformations can underlie a lot of the 
> transformations we need for non-HPC code (and, which I'll point out, we 
> need for modern HPC code too). In practice, the loops that we can 
> actually analyze have affine dependencies, and Polly does, or can do, a 
> better job at generating runtime predicates and dealing with 
> piecewise-linear expressions than our current infrastructure.
> 
> In short, I look at Polly as two things: First, an infrastructure for 
> dealing with loop analysis and transformation. I view this as being 
> broadly applicable. Second, an application of that to apply 
> cost-model-driven classic loop transformations. To some extent this is 
> going to be more useful for HPC codes, but also applies to machine 
> learning, signal processing, graphics, and other areas.

Hal is very right here. Polly -- by itself -- is an end-to-end
implementation of a polyhedral loop optimization framework, but it
consists of several components -- which I expect will become
increasingly modular and also will be integrated at more and more places
into LLVM. I expect this integration to be gradually and driven by
community discussions and consensus.  For me, the primary point of this
proposal is to enable such an integration and especially to make sure we
have maximal community input at each step -- carefully weighting the
different goals we have in the community.

> > And this brings me to the maintenance question.  Is it reasonable to 
> > expect people to fix Polly when they have a seemingly unrelated change 
> > that happens to break a Polly bot.
> 
> The eventual goal here is to have this technology in appropriate parts 
> of the main pipeline, and so the question here is not really about 
> breaking a "Polly bot", but just about a "bot" in general. I've given 
> this question some thought and I think it sits in a reasonable place in 
> the risk-reward space. The answer would be, yes, we'd need to treat this 
> like any other part of the pipeline. However, I believe that Polly has 
> as many, or more, active contributors than essentially any other 
> individual part of the mid-level optimizer or CodeGen. As a result, 
> there will be people around in many time zones to help with problems 
> with Polly-related code.

I personally would be even more conservative than Hal. While Polly is
meanwhile very robust and has been shipped in production compilers, I
see this proposal more in line with an "experimental backend". Hence,
the maintenance burden will lie on Polly itself. As we did over the last
years, the Polly team will address bugs and regressions in buildbots
without posing any load on regular development. In fact this promise is
easy to make. We have at most 1-2 bugs a months that are caused by LLVM
changes and some of these bugs are mis-optimizations in LLVM which we
have a history of reporting and addressing. 

> >  As far as I know, there were companies in the past that tried Polly 
> > without a whole lot of prior experience.  It would be great to hear 
> > what the experience was before adopting Polly at a much larger scale.
> 
> I'm also interested, although I'll caution against over-interpreting any 
> evidence here (positive or negative). Before a few weeks ago, Polly 
> didn't effectively run in the pipeline after inlining, and so I doubt it 
> would have been much use outside of embedded environments (and maybe 
> some HPC environments) with straightforwardly-presented C code. It's 
> only now that this has been fixed that I find the possibility of 
> integrating this in production interesting.

Eli Friedman from Qualcomm has been working with us on the robustness of
Polly by regularly testing it on AOSP, Roman Gareev worked with ARM on
linear algebra kernels (Polly now get's within 10-20% of vendor
libraries) (should work out-of-the-box), we are currently working with
MeteoSwiss/CSCS on the COSMO weather and climate mode, and together with
Johannes we have been working on some SPEC benchmarks (results still to
be published). We also see non-trivial speedups (2-4x) on SPEC for
automatic GPGPU code generation:
http://grosser.es/publications/grosser-2016-polly-acc-transparent-compilation-to-heterogeneous-hardware.pdf

However, more important for an initial implementation is compile time
cost. Polyhedral scheduling was the one piece of Polly that had an
inherently exponential cost as code-sizes grew. Since about half a year
we now have a new incremental scheduling approach which (assuming
limited fusion) limits the scheduling cost to the maximal amount of loop
fusion we choose (see section 7.3):
https://www.researchgate.net/publication/317826152_Scheduling_for_PPCG

I feel that at this point broader community input to our development is
very critical for the next steps -- especially as more people (e.g.,
Hal's group) want to apply polyhedral based optimizations in LLVM. To
make sure Polly evolves in a direction  well aligned with LLVM goals,
the input of LLVM loop optimization experts like you (Adam), Gerolf and
the larger LLVM community
will be very helpful. Moving our day-to-day discussions and development
under the full LLVM umbrella will make this possible.

I do not see this proposal to immediately replace all classical loop
transformations, but rather expect Polly to complement existing
optimizations. To which degree will depend on our experiences over the
next year.

Best,
Tobias