<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">2017-09-21 22:22 GMT-07:00 Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><div><div class="h5">
<p><br>
</p>
<div class="m_-8232648019342385258moz-cite-prefix">On 09/22/2017 12:03 AM, Mehdi AMINI
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi Hal,
<div><br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">2017-09-21 20:59 GMT-07:00 Hal Finkel
via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span> <br>
<div class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
09/12/2017 10:26 PM, Gerolf Hoflehner wrote:<br>
</div>
</span>
<blockquote type="cite">
<div dir="auto" style="word-wrap:break-word;line-break:after-white-space"><br>
<div><br>
<blockquote type="cite"><span>
<div>On Sep 11, 2017, at 10:47 PM, Hal Finkel
via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>
wrote:</div>
<br class="m_-8232648019342385258m_-2166750394864443591Apple-interchange-newline">
</span>
<div>
<div bgcolor="#FFFFFF" text="#000000"><span>
<p><br>
</p>
<div class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
09/11/2017 12:26 PM, Adam Nemet wrote:<br>
</div>
</span>
<blockquote type="cite"><span> Hi
Hal, Tobias, Michael and others, </span>
<div><b>...</b><span>
<div>
<div><br>
</div>
<div>One thing that I’d like to see
more details on is what this means
for the evolution of loop
transformations in LLVM.</div>
<div><br>
</div>
<div>Our more-or-less established
direction was so far to
incrementally improve and
generalize the required analyses
(e.g. the LoopVectorizer’s
dependence analysis + loop
versioning analysis into a
stand-alone analysis pass
(LoopAccessAnalysis)) and then
build new transformations (e.g.
LoopDistribution,
LoopLoadElimination,
LICMLoopVersioning) on top of
these.</div>
<div><br>
</div>
<div>The idea was that
infrastructure would be
incrementally improved from two
directions:</div>
<div><br>
</div>
<div>- As new transformations are
built analyses have to be improved
(e.g. past improvements to LAA to
support the LoopVersioning
utility, future improvements for
full LoopSROA beyond just
store->load forwarding [1] or
the improvements to LAA for the
LoopFusion proposal[2])</div>
<div><br>
</div>
<div>- As more complex loops would
have to be analyzed we either
improve LAA or make
DependenceAnalysis a drop-in
replacement for the memory
analysis part in LAA</div>
</div>
</span></div>
</blockquote>
<span> <br>
Or we could use Polly's dependence
analysis, which I believe to be more
powerful, more robust, and more correct
than DependenceAnalysis. I believe that
the difficult part here is actually the
pairing with predicated SCEV or whatever
mechanism we want to use generate runtime
predicates (this applies to use of
DependenceAnalysis too).<br>
</span></div>
</div>
</blockquote>
<span>
<div><br>
</div>
What is a good way to measure these assertions
(More powerful, more robust)? Are you saying the
LLVM Dependence Analysis is incorrect or do you
actually mean less conservative (or "more
accurate" or something like that)?<br>
</span></div>
</div>
</blockquote>
<br>
Sebastian's email covers the issues with the
DependenceAnalysis pass pretty well.<br>
<br>
Regarding what's in LoopAccessAnalysis, I believe it to
be correct, but more limited. It is not clear to me that
LAA is bad at what it does based on what the vectorizer
can handle. LAA could do better in some cases with
non-unit-stride loops. Polly also handles
piecewise-affine functions, which allows the modeling of
loops with conditionals. Extending LAA to handle loop
nests, moreover, seems likely to be non-trivial.<br>
<br>
Regardless, measuring these differences certainly seems
like a good idea. I think that we can do this using
optimization remarks. LAA already emits optimization
remarks for loops in which it finds unsafe memory
dependencies. Polly also emits optimization remarks. We
may need to iterate some in order to setup a good
comparison, but we should be able to collect statistics
(and other information) by compiling code using
-fsave-optimization-record (in combination with some
other flags), and then analyzing the resulting YAML
files.<span><br>
<br>
<blockquote type="cite">
<div dir="auto" style="word-wrap:break-word;line-break:after-white-space">
<div>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF" text="#000000"> <br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>While this model may be slow it
has all the benefits of the
incremental development model.</div>
</div>
</div>
</blockquote>
<br>
The current model may have been slow in
many areas, but I think that's mostly a
question of development effort. My largest
concern about the current model is that,
to the extent that we're implementing
classic loop transformations (e.g.,
fusion, distribution, interchange,
skewing, tiling, and so on), we're
repeating a historical design that is
known to have several suboptimal
properties. Chief among them is the lack
of integration: many of these
transformations are interconnected, and
there's no good pass ordering in which to
make independent decisions. Many of these
transformations can be captured in a
single model and we can get much better
results by integrating them. There's also
the matter of whether building these
transformation on SCEV (or IR directly) is
the best underlying infrastructure, or
whether parts of Polly would be better.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
I believe that is true. What I wonder is is
there a good method to reason about it?</div>
</div>
</blockquote>
<br>
</span> If I understand what you mean, one way to look
at it is this: This is not a canonicalization problem.
Picking an optimal way to interchange loops may depend
on how the result can be skewed and/or tiled, picking an
optimal way to distribute loops often depends on what
can be done afterward in each piece. Optimal here
generally involves reasoning about the memory hierarchy
(e.g., cache properties), available prefetching streams,
register-file size, and so on.<br>
<br>
I know that I've seen some good examples in papers over
the years that illustrate the phase-ordering challenges.
Hopefully, someone will jump in here with some good
references. One classic one is: William Pugh. Uniform
Techniques for Loop Optimization. 1991.<span><br>
<br>
<blockquote type="cite">
<div dir="auto" style="word-wrap:break-word;line-break:after-white-space">
<div> Perhaps concrete examples or perhaps
opt-viewer based comparisons on large sets of
benchmarks? In the big picture you could make
such a modeling argument for all compiler
optimizations.<br>
</div>
</div>
</blockquote>
<br>
</span> Certainly. However, in this case there's a
well-studied unified model for this set of optimizations
known to reduce phase-ordering effects. That's not true
in general.<span><br>
<br>
<blockquote type="cite">
<div dir="auto" style="word-wrap:break-word;line-break:after-white-space">
<div>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF" text="#000000"> <br>
That having been said, I think that
integrating this technology into LLVM will
also mean applying appropriate modularity.
I think that we'll almost definitely want
to make use of the dependence analysis
separately as an analysis. We'll want to
decide which of these transformations will
be considered canonicalization (and run in
the iterative pipeline) and which will be
lowering (and run near the vectorizer).
LoopSROA certainly sounds to me like
canonicalization, but loop fusion might
also fall into that category (i.e., we
might want to fuse early to enable
optimizations and then split late).<br>
<br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>Then there is the question of
use cases. It’s fairly obvious
that anybody wanting to optimize a
5-deep highly regular loop-nest
operating on arrays should use
Polly. On the other hand it’s way
less clear that we should use it
for singly or doubly nested
not-so-regular loops which are the
norm in non-HPC workloads.</div>
</div>
</div>
</blockquote>
<br>
This is clearly a good question, but
thinking about Polly as a set of
components, not as a monolithic
transformation component, I think that
polyhedral analysis and transformations
can underlie a lot of the transformations
we need for non-HPC code (and, which I'll
point out, we need for modern HPC code
too). In practice, the loops that we can
actually analyze have affine dependencies,
and Polly does, or can do, a better job at
generating runtime predicates and dealing
with piecewise-linear expressions than our
current infrastructure.<br>
<br>
In short, I look at Polly as two things:
First, an infrastructure for dealing with
loop analysis and transformation. I view
this as being broadly applicable. Second,
an application of that to apply
cost-model-driven classic loop
transformations. To some extent this is
going to be more useful for HPC codes, but
also applies to machine learning, signal
processing, graphics, and other areas. <br>
</div>
</div>
</blockquote>
I’m wondering if it could be used for pointing
out headroom for the existing LLVM ecosystem (*)<br>
</div>
</div>
</blockquote>
<br>
</span> Sure.<span><br>
<br>
<blockquote type="cite">
<div dir="auto" style="word-wrap:break-word;line-break:after-white-space">
<div><br>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF" text="#000000"> <br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>And this brings me to the
maintenance question. Is it
reasonable to expect people to fix
Polly when they have a seemingly
unrelated change that happens to
break a Polly bot.</div>
</div>
</div>
</blockquote>
<br>
The eventual goal here is to have this
technology in appropriate parts of the
main pipeline, and so the question here is
not really about breaking a "Polly bot",
but just about a "bot" in general. I've
given this question some thought and I
think it sits in a reasonable place in the
risk-reward space. The answer would be,
yes, we'd need to treat this like any
other part of the pipeline. However, I
believe that Polly has as many, or more,
active contributors than essentially any
other individual part of the mid-level
optimizer or CodeGen. As a result, there
will be people around in many time zones
to help with problems with Polly-related
code.<br>
<br>
<blockquote type="cite">
<div>
<div>
<div> As far as I know, there were
companies in the past that tried
Polly without a whole lot of prior
experience. It would be great to
hear what the experience was
before adopting Polly at a much
larger scale.</div>
</div>
</div>
</blockquote>
<br>
I'm also interested, although I'll caution
against over-interpreting any evidence
here (positive or negative). Before a few
weeks ago, Polly didn't effectively run in
the pipeline after inlining, and so I
doubt it would have been much use outside
of embedded environments (and maybe some
HPC environments) with
straightforwardly-presented C code. It's
only now that this has been fixed that I
find the possibility of integrating this
in production interesting.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
That is a good point. There are also biases
independent of past experiences (for disclosure
mine is (*) above). But I think it is objective
to say a Polly integration is a big piece to
swallow.Your pro-Polly argument lists a number
of categories that I think could be reasoned
about individually and partly evaluated with a
data-driven approach:</div>
<div>A) Architecture</div>
<div>- support for autoparallelism</div>
<div>- support for accelerators</div>
<div>- isl- rewrite? etc</div>
<div>...</div>
<div>B) Modelling</div>
<div>- polyhedral model</div>
<div>- temporal locality</div>
<div>- spatial locality </div>
<div>…</div>
<div>C) Analysis/Optimizations</div>
<div>- Dependence Analysis</div>
<div>- Transformation effective/power (loop nests,
quality of transformations, #vectorizable loops
etc)</div>
<div><br>
</div>
<div>A) is mostly Polly independent (except for
the isl question I guess). For B and C
performance/ compile-time /opt-viewer data on a
decent/wide range of benchmarks possibly at
different optimization levels (O2, O3, LTO, PGO
etc and combinations) should provide data-driven
insight into costs/benefits. <br>
</div>
</div>
</blockquote>
<br>
</span> I agree. In practice, the first question is: Are
will willing to take on Polly (+isl), in whole or in
part, as a build dependency? If the answer is yes, the
next question is: what parts should be reused or
refactored for use in other parts of the pipeline? My
argument is that we should take on Polly, or most of it,
as a build dependency. Work on better unifying the
developer communities as we start experimenting with
other kinds of integration. This will, however, allow us
to provide to all of our users these transformations
through pragmas (and other kinds of optional
enablement). This is an important first step.<br>
</div>
</blockquote>
<div><br>
</div>
<div>When you write that we should take Polly as a build
dependency: are you envisioning this to be tied into LLVM
to the point where we can't build LLVM without Polly?</div>
<div>I thought the approach would rather be that a default
build of LLVM would have Polly available but that there
will "forever" be a `-DENABLE_POLLY=OFF` option?</div>
<div>This seems like an important distinction to me, as
making it more integrated but still optional for the
correct behavior of LLVM means that people can continue to
work and maintain "forever" an optimization pipeline which
operates without Polly, while starting to get more
integration in a pluggable form. I believe this is the
direction that was taken with the GSoC last year, trying
to get Polly as a "pluggable" dependency analysis.</div>
</div>
</div>
</div>
</blockquote>
<br></div></div>
I don't want to make statements about "forever":</div></blockquote><div><br></div><div>Right, I forgot the asterisk to explain my quotes around "forever": I meant by this "not until we have compelling data and agreement to do otherwise". I think we're on agreement on this though :)</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"> eventually, I'd
like to have strong loop optimizations as an integrated part of
LLVM. At that point, "Polly" might be little more than a shorthand
for some parts of LLVM. Now LLVM currently has an issue whereby it's
difficult to build customized versions with only a subset of needed
passes, but that's an orthogonal issue. In the short term, it would
certainly be separable. I believe that we'd certainly keep the
`-DENABLE_POLLY=OFF` option as long as we felt it made sense.<br>
<br>
By build dependency, I mean: The sources are part of the main source
tree, they're built by default, thus essentially all of the builders
build them, and they'll end up in many distributions and derived
products. This will allow us to build on what's there in whatever
ways make sense.</div></blockquote><div><br></div><div>Thanks for clarifying!</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div bgcolor="#FFFFFF" text="#000000"><span class=""><blockquote type="cite"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
I'm not sure exactly how good this is, but polly has
LNT-submitting bots, so the website can generate a
comparison (e.g., <a class="m_-8232648019342385258m_-2166750394864443591moz-txt-link-freetext" href="http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182" target="_blank">http://lnt.llvm.org/db_default<wbr>/v4/nts/71208?compare_to=71182</a><wbr>).
Looking at this comparison shows a number of potential
problems but also cases where Polly really helps (and,
FWIW, the largest two compile-time regressions are also
associated with very large execution performance
improvements). My first focus would certainly be on
pragma-driven enablement.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I'll add: PGO-driven enablement: spending compile-time
where we know it can have an impact. What do you think?</div>
</div>
</div>
</div>
</blockquote>
<br></span>
Certainly (although, obviously, we still need reasonable confidence
that it won't cause performance regressions to do that).<br></div></blockquote><div><br></div><div>Oh right, I meant it through an opt-in flag to begin with (-fenable-pgo-polly or something like that).</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div></div></div></div>