[llvm-dev] [RFC] Polly Status and Integration

Fri Sep 22 11:21:18 PDT 2017

On 09/22/2017 08:47 AM, Johannes Doerfert wrote:
> Hi Hal,
>
> On 09/21, Hal Finkel via llvm-dev wrote:
>> On 09/12/2017 10:26 PM, Gerolf Hoflehner wrote:
>>>
>>> ...
>> Polly also handles piecewise-affine functions, which allows the
>> modeling of loops with conditionals. Extending LAA to handle loop
>> nests, moreover, seems likely to be non-trivial.
> What exactly do you mean here? A loop with a conditional does not
> necessarily cause a piecewise-affine (access) function. The ones that
> actually do cause piecewise-affine (acces) functions are not handled by
> Polly (except min/max). This is not because Polly could not handle them
> but because it relies on Scalar Evolution.

Thanks for clarifying this. min/max is certainly an important special 
case, although it would certainly be good to handle more-general cases. 
Not to imply that SCEV is an ideal tool for this purpose (the 
replacement you propose, for example, may well be better), but it would 
seem like matching access functions that look like select(cond, AddRec, 
AddRec), and similar, would be useful and practical even within the 
current framework.

I'll immediately point out that, OTOH, while piecewise access functions 
are likely to be converted into selects early in our current 
infrastructure, there's also an ongoing discussion about whether that's 
actually optimal (i.e., forming selects early is likely suboptimal for 
GVN and we might want to delay this conversion until later in the 
pipeline). See PR34603. And so, if we want to think about using this 
kind of analysis during canonicalization as well as during lowering, 
more may be required to handle this consistently.

>
>> Regardless, measuring these differences certainly seems like a good idea.
> It does, that's why I try to get some actual performance results soon.
> If somebody does the same for Polly, e.g., by extending the GSoC'16
> interface that exists, we will have actual data to talk about.

Given that Polly now runs right before the vectorizer, can we do some 
comparisons immediately just by looking at loops which LAA can't analyze 
but Polly can analyze (or vice versa)?

>
>> I think that we can do this using optimization remarks. LAA already
>> emits optimization remarks for loops in which it finds unsafe memory
>> dependencies.  Polly also emits optimization remarks. We may need to
>> iterate some in order to setup a good comparison, but we should be
>> able to collect statistics (and other information) by compiling code
>> using -fsave-optimization-record (in combination with some other
>> flags), and then analyzing the resulting YAML files.
>>
>>>>> While this model may be slow it has all the benefits of the
>>>>> incremental development model.
>>>> The current model may have been slow in many areas, but I think that's
>>>> mostly a question of development effort. My largest concern about the
>>>> current model is that, to the extent that we're implementing classic
>>>> loop transformations (e.g., fusion, distribution, interchange, skewing,
>>>> tiling, and so on), we're repeating a historical design that is known to
>>>> have several suboptimal properties. Chief among them is the lack of
>>>> integration: many of these transformations are interconnected, and
>>>> there's no good pass ordering in which to make independent decisions.
>>>> Many of these transformations can be captured in a single model and we
>>>> can get much better results by integrating them. There's also the matter
>>>> of whether building these transformation on SCEV (or IR directly) is the
>>>> best underlying infrastructure, or whether parts of Polly would be
>>>> better.
>>> I believe that is true. What I wonder is is there a good method to reason
>>> about it?
>> If I understand what you mean, one way to look at it is this: This is not a
>> canonicalization problem. Picking an optimal way to interchange loops may
>> depend on how the result can be skewed and/or tiled, picking an optimal way
>> to distribute loops often depends on what can be done afterward in each
>> piece. Optimal here generally involves reasoning about the memory hierarchy
>> (e.g., cache properties), available prefetching streams, register-file size,
>> and so on.
> While I totally agree with the idea here I am not so convinced anymore.
> The polyhedral scheduling technique that I have seen do not take more
> than two of the above properties into account. One problem is simply the
> way the scheduler works (per statement, dimension after dimension,
> starting with the outermost one) and what that means for the quality of
> performance prediction during the scheduling.

I agree that we may need to change the way that the scheduler works. 
However, there's value in having a unified transformation framework 
capable of performing the selected schedule (regardless of how we arrive 
at that schedule). Moreover, I suspect that we'll need to have 
conversation about the scheduler design in the context of the cost 
modeling parameters we need to consider, and how those models are 
expressed. As I understand it, the current incremental technique is 
largely motivated by efficiency concerns and/or to keep the problems 
linear? FWIW, while I'm not an expert in this area, a few of my 
colleagues at the lab are experts in mixed-integer optimization, and at 
some point soon I'll chat with them about the approach here.

>   It is also problematic
> that properties might be "non-affine" which doesn't allow us to encode
> them directly.

But we'll solve this either by over approximating or restricting the 
problem space (e.g., to an inner loop where we can perform the 
analysis). Either way, this seems like an orthogonal concern.

>   At the moment the only "heuristic" you mentioned that is
> actually used in Polly are cache properties for the hand-tuned
> optimization (see below).

Certainly true. I think this will need to change if we want to do 
purely-cost-model-driven loop optimizations.

>
>> I know that I've seen some good examples in papers over the years that
>> illustrate the phase-ordering challenges. Hopefully, someone will jump in
>> here with some good references. One classic one is: William Pugh. Uniform
>> Techniques for Loop Optimization. 1991.
>>
>>> Perhaps concrete examples or perhaps opt-viewer based comparisons on large
>>> sets of benchmarks? In the big picture you could make such a modeling
>>> argument for all compiler optimizations.
>> Certainly. However, in this case there's a well-studied unified model for
>> this set of optimizations known to reduce phase-ordering effects. That's not
>> true in general.
> Polyhedral scheduling is not restricted by phase-ordering effects.
> Though, it still has to be guided by an objective function that
> realistically models "sufficient performance indicators" to actually
> improve performance. I think we should talk more about how we could
> get such an objective function and what might be intermediate steps as
> it is (at least to me) absolutely not clear.

The reason that I've made such a point of talking about the cost 
modeling is that:

  1. I think this is a large current weakness of Polly
  2. I think this is a large current deficiency in the literature (i.e., 
I don't know of anyone else who (publicly) discusses how to do this well)
  3. I think that improving this is going to be a lot of work

It's pretty clear that we need to handle a bunch of different effects 
here (depending on what we're doing), not just cache effects, but also 
prefetching streams, ILP, register pressure, and more. An exception to 
this will be if we do some transformations early (i.e., during 
canonicalization), in which case, we might not require as much from the 
cost model. That having been said, some potential canonicalization 
choices (e.g., loop fusion) may be impractical until we have a good 
cost-model-based way to undo them (e.g., loop fission).

Moreover, I think that we'll want to think seriously about how to we 
loop multiversioning in this context. A loop structure that is optimized 
for large working sets might not be optimal for working sets that fit in 
L1, or even L2, cache.

>
>> ...
>> I'm not sure exactly how good this is, but polly has LNT-submitting bots, so
>> the website can generate a comparison (e.g.,
>> http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182). Looking at
>> this comparison shows a number of potential problems but also cases where
>> Polly really helps (and, FWIW, the largest two compile-time regressions are
>> also associated with very large execution performance improvements). My
>> first focus would certainly be on pragma-driven enablement.
> To be completely honest, I think this comparisons oversell the positive
> runtime effects of Polly quite a bit.

Funny. I wasn't overly impressed by the comparison as it is. Mixed 
results on pollybench and positive results mostly on simple benchmark 
loops. One bright spot here is actually 
SingleSource/Benchmarks/Shootout-C++/ary3, because it shows that Polly 
can successfully transform a loop that uses std::vector (instead of just 
plain C-style arrays). The only DOE workload proxy, out of our current 
ten, that shows up here is 
MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC, and that's a 30% 
performance regression. Clearly I'm not advocating we turn this on by 
default under these conditions.

>   (Though I do not think you try to
> do that on purpose). Let me explain:
>
> At least the first 5 of the 19 improved benchmarks are matrix-matrix
> multiplications. These are not automatically scheduled by Polly but
> instead pattern matched and replaced by hand-tuned, target specific
> versions (ref. cache properties). Even if you argue this is a valid
> "automatic optimization" it still is only one improvement shown five
> times.

The first three are certainly very similar in this regard. The others 
have outer loops, or other loops, that could be relevant (although I 
imagine that the matrix-multiplication part is very relevant). 
Regardless, this point is well taken.

>   Note also how matrix-matrix multiplications perform when there
> are not matched:
>    MultiSource/Benchmarks/VersaBench/bmm/bmm
>    SingleSource/Benchmarks/Polybench/linear-algebra/kernels/trmm/trmm

For everyone's convenience, in this comparison, both of these show 
significant compile-time and runtime regressions.

>
> Last time I checked the impact of Polly on the TSVC benchmarks I
> realized that the effect was due to unrelated passes scheduled by Polly.
> This might have changed by now but I doubt it.

I was under the impression that this did change when Polly was moved 
later in the pipeline (and it looks like this is true from the code in 
Support/RegisterPasses.cpp, but I certainly could be missing something). 
OTOH, if you're right, and this is due to a late run of some 
canonicalization pass (the things registered by 
polly::registerCanonicalicationPasses), we should definitely understand 
that: some of those performance improvements are very large (~25% for 
/LoopRestructuring-flt).

> This leaves ten improved
> Polybench benchmarks and whetstone. At least my local version of Polly
> does not optimize the latter, thus the effect might be the same as for
> the three TSVC benchmarks. For the Polybench benchmarks it would be
> interesting to know what kind of transformation improved ten of them and
> what regressed five others. I am especially interested if there is a
> benefit from polyhedral scheduling or if it is only due to the tiling
> that is applied afterwards.

Definitely good questions.

Regardless, it's clear that the cost modeling needs work (and there are 
other potential significant improvements). In the short term, I find 
most appealing the framework for enabling complex transformations, which 
we can enable by pragmas, and trying to better integrate the community. 
The point I take is that Polly can do good things, whether by general 
scheduling or by pattern matching (and it's good that it can do both in 
the same infrastructure).

>
> --
>
> I don't want to play devil's advocate here all the time but I think we
> should be realistic on what we have and can expect. If not, we will
> only cause unfulfilled expectations that won't help anybody.

I find this feedback very valuable (and I'm sure that others do as 
well). Besides, you've made concrete alternative suggestions, so you're 
not just playing devil's advocate.

Thanks again,
Hal

>
> Thanks,
>    Johannes
>
> ...
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory