<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<p><br>
</p>
<div class="moz-cite-prefix">On 09/22/2017 12:25 AM, Mehdi AMINI
wrote:<br>
</div>
<blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<div dir="ltr"><br>
<div class="gmail_extra"><br>
<div class="gmail_quote">2017-09-21 22:22 GMT-07:00 Hal Finkel
<span dir="ltr"><<a moz-do-not-send="true"
href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span>:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
<div>
<div class="h5">
<p><br>
</p>
<div class="m_-8232648019342385258moz-cite-prefix">On
09/22/2017 12:03 AM, Mehdi AMINI wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">Hi Hal,
<div><br>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">2017-09-21 20:59
GMT-07:00 Hal Finkel via llvm-dev <span
dir="ltr"><<a moz-do-not-send="true"
href="mailto:llvm-dev@lists.llvm.org"
target="_blank">llvm-dev@lists.llvm.org</a>></span>:<br>
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span>
<br>
<div
class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
09/12/2017 10:26 PM, Gerolf
Hoflehner wrote:<br>
</div>
</span>
<blockquote type="cite">
<div dir="auto"
style="word-wrap:break-word;line-break:after-white-space"><br>
<div><br>
<blockquote type="cite"><span>
<div>On Sep 11, 2017, at 10:47
PM, Hal Finkel via llvm-dev
<<a
moz-do-not-send="true"
href="mailto:llvm-dev@lists.llvm.org"
target="_blank">llvm-dev@lists.llvm.org</a>>
wrote:</div>
<br
class="m_-8232648019342385258m_-2166750394864443591Apple-interchange-newline">
</span>
<div>
<div bgcolor="#FFFFFF"
text="#000000"><span>
<p><br>
</p>
<div
class="m_-8232648019342385258m_-2166750394864443591moz-cite-prefix">On
09/11/2017 12:26 PM,
Adam Nemet wrote:<br>
</div>
</span>
<blockquote type="cite"><span>
Hi Hal, Tobias, Michael
and others, </span>
<div><b>...</b><span>
<div>
<div><br>
</div>
<div>One thing that
I’d like to see
more details on is
what this means
for the evolution
of loop
transformations in
LLVM.</div>
<div><br>
</div>
<div>Our
more-or-less
established
direction was so
far to
incrementally
improve and
generalize the
required analyses
(e.g. the
LoopVectorizer’s
dependence
analysis + loop
versioning
analysis into a
stand-alone
analysis pass
(LoopAccessAnalysis))
and then build new
transformations
(e.g.
LoopDistribution,
LoopLoadElimination, LICMLoopVersioning) on top of these.</div>
<div><br>
</div>
<div>The idea was
that
infrastructure
would be
incrementally
improved from two
directions:</div>
<div><br>
</div>
<div>- As new
transformations
are built analyses
have to be
improved (e.g.
past improvements
to LAA to support
the LoopVersioning
utility, future
improvements for
full LoopSROA
beyond just
store->load
forwarding [1] or
the improvements
to LAA for the
LoopFusion
proposal[2])</div>
<div><br>
</div>
<div>- As more
complex loops
would have to be
analyzed we either
improve LAA or
make
DependenceAnalysis
a drop-in
replacement for
the memory
analysis part in
LAA</div>
</div>
</span></div>
</blockquote>
<span> <br>
Or we could use Polly's
dependence analysis, which
I believe to be more
powerful, more robust, and
more correct than
DependenceAnalysis. I
believe that the difficult
part here is actually the
pairing with predicated
SCEV or whatever mechanism
we want to use generate
runtime predicates (this
applies to use of
DependenceAnalysis too).<br>
</span></div>
</div>
</blockquote>
<span>
<div><br>
</div>
What is a good way to measure
these assertions (More powerful,
more robust)? Are you saying the
LLVM Dependence Analysis is
incorrect or do you actually
mean less conservative (or "more
accurate" or something like
that)?<br>
</span></div>
</div>
</blockquote>
<br>
Sebastian's email covers the issues with
the DependenceAnalysis pass pretty well.<br>
<br>
Regarding what's in LoopAccessAnalysis,
I believe it to be correct, but more
limited. It is not clear to me that LAA
is bad at what it does based on what the
vectorizer can handle. LAA could do
better in some cases with
non-unit-stride loops. Polly also
handles piecewise-affine functions,
which allows the modeling of loops with
conditionals. Extending LAA to handle
loop nests, moreover, seems likely to be
non-trivial.<br>
<br>
Regardless, measuring these differences
certainly seems like a good idea. I
think that we can do this using
optimization remarks. LAA already emits
optimization remarks for loops in which
it finds unsafe memory dependencies.
Polly also emits optimization remarks.
We may need to iterate some in order to
setup a good comparison, but we should
be able to collect statistics (and other
information) by compiling code using
-fsave-optimization-record (in
combination with some other flags), and
then analyzing the resulting YAML files.<span><br>
<br>
<blockquote type="cite">
<div dir="auto"
style="word-wrap:break-word;line-break:after-white-space">
<div>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF"
text="#000000"> <br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>While this
model may be slow
it has all the
benefits of the
incremental
development model.</div>
</div>
</div>
</blockquote>
<br>
The current model may have
been slow in many areas,
but I think that's mostly
a question of development
effort. My largest concern
about the current model is
that, to the extent that
we're implementing classic
loop transformations
(e.g., fusion,
distribution, interchange,
skewing, tiling, and so
on), we're repeating a
historical design that is
known to have several
suboptimal properties.
Chief among them is the
lack of integration: many
of these transformations
are interconnected, and
there's no good pass
ordering in which to make
independent decisions.
Many of these
transformations can be
captured in a single model
and we can get much better
results by integrating
them. There's also the
matter of whether building
these transformation on
SCEV (or IR directly) is
the best underlying
infrastructure, or whether
parts of Polly would be
better.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
I believe that is true. What I
wonder is is there a good method
to reason about it?</div>
</div>
</blockquote>
<br>
</span> If I understand what you mean,
one way to look at it is this: This is
not a canonicalization problem. Picking
an optimal way to interchange loops may
depend on how the result can be skewed
and/or tiled, picking an optimal way to
distribute loops often depends on what
can be done afterward in each piece.
Optimal here generally involves
reasoning about the memory hierarchy
(e.g., cache properties), available
prefetching streams, register-file size,
and so on.<br>
<br>
I know that I've seen some good examples
in papers over the years that illustrate
the phase-ordering challenges.
Hopefully, someone will jump in here
with some good references. One classic
one is: William Pugh. Uniform Techniques
for Loop Optimization. 1991.<span><br>
<br>
<blockquote type="cite">
<div dir="auto"
style="word-wrap:break-word;line-break:after-white-space">
<div> Perhaps concrete examples or
perhaps opt-viewer based
comparisons on large sets of
benchmarks? In the big picture
you could make such a modeling
argument for all compiler
optimizations.<br>
</div>
</div>
</blockquote>
<br>
</span> Certainly. However, in this case
there's a well-studied unified model for
this set of optimizations known to
reduce phase-ordering effects. That's
not true in general.<span><br>
<br>
<blockquote type="cite">
<div dir="auto"
style="word-wrap:break-word;line-break:after-white-space">
<div>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF"
text="#000000"> <br>
That having been said, I
think that integrating
this technology into LLVM
will also mean applying
appropriate modularity. I
think that we'll almost
definitely want to make
use of the dependence
analysis separately as an
analysis. We'll want to
decide which of these
transformations will be
considered
canonicalization (and run
in the iterative pipeline)
and which will be lowering
(and run near the
vectorizer). LoopSROA
certainly sounds to me
like canonicalization, but
loop fusion might also
fall into that category
(i.e., we might want to
fuse early to enable
optimizations and then
split late).<br>
<br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>Then there is
the question of
use cases. It’s
fairly obvious
that anybody
wanting to
optimize a 5-deep
highly regular
loop-nest
operating on
arrays should use
Polly. On the
other hand it’s
way less clear
that we should use
it for singly or
doubly nested
not-so-regular
loops which are
the norm in
non-HPC workloads.</div>
</div>
</div>
</blockquote>
<br>
This is clearly a good
question, but thinking
about Polly as a set of
components, not as a
monolithic transformation
component, I think that
polyhedral analysis and
transformations can
underlie a lot of the
transformations we need
for non-HPC code (and,
which I'll point out, we
need for modern HPC code
too). In practice, the
loops that we can actually
analyze have affine
dependencies, and Polly
does, or can do, a better
job at generating runtime
predicates and dealing
with piecewise-linear
expressions than our
current infrastructure.<br>
<br>
In short, I look at Polly
as two things: First, an
infrastructure for dealing
with loop analysis and
transformation. I view
this as being broadly
applicable. Second, an
application of that to
apply cost-model-driven
classic loop
transformations. To some
extent this is going to be
more useful for HPC codes,
but also applies to
machine learning, signal
processing, graphics, and
other areas. <br>
</div>
</div>
</blockquote>
I’m wondering if it could be
used for pointing out headroom
for the existing LLVM ecosystem
(*)<br>
</div>
</div>
</blockquote>
<br>
</span> Sure.<span><br>
<br>
<blockquote type="cite">
<div dir="auto"
style="word-wrap:break-word;line-break:after-white-space">
<div><br>
<blockquote type="cite">
<div>
<div bgcolor="#FFFFFF"
text="#000000"> <br>
<blockquote type="cite">
<div>
<div>
<div><br>
</div>
<div>And this brings
me to the
maintenance
question. Is it
reasonable to
expect people to
fix Polly when
they have a
seemingly
unrelated change
that happens to
break a Polly bot.</div>
</div>
</div>
</blockquote>
<br>
The eventual goal here is
to have this technology in
appropriate parts of the
main pipeline, and so the
question here is not
really about breaking a
"Polly bot", but just
about a "bot" in general.
I've given this question
some thought and I think
it sits in a reasonable
place in the risk-reward
space. The answer would
be, yes, we'd need to
treat this like any other
part of the pipeline.
However, I believe that
Polly has as many, or
more, active contributors
than essentially any other
individual part of the
mid-level optimizer or
CodeGen. As a result,
there will be people
around in many time zones
to help with problems with
Polly-related code.<br>
<br>
<blockquote type="cite">
<div>
<div>
<div> As far as I
know, there were
companies in the
past that tried
Polly without a
whole lot of prior
experience. It
would be great to
hear what the
experience was
before adopting
Polly at a much
larger scale.</div>
</div>
</div>
</blockquote>
<br>
I'm also interested,
although I'll caution
against over-interpreting
any evidence here
(positive or negative).
Before a few weeks ago,
Polly didn't effectively
run in the pipeline after
inlining, and so I doubt
it would have been much
use outside of embedded
environments (and maybe
some HPC environments)
with
straightforwardly-presented
C code. It's only now that
this has been fixed that I
find the possibility of
integrating this in
production interesting.<br>
</div>
</div>
</blockquote>
<div><br>
</div>
That is a good point. There are
also biases independent of past
experiences (for disclosure mine
is (*) above). But I think it is
objective to say a Polly
integration is a big piece to
swallow.Your pro-Polly argument
lists a number of categories
that I think could be reasoned
about individually and partly
evaluated with a data-driven
approach:</div>
<div>A) Architecture</div>
<div>- support for autoparallelism</div>
<div>- support for accelerators</div>
<div>- isl- rewrite? etc</div>
<div>...</div>
<div>B) Modelling</div>
<div>- polyhedral model</div>
<div>- temporal locality</div>
<div>- spatial locality </div>
<div>…</div>
<div>C) Analysis/Optimizations</div>
<div>- Dependence Analysis</div>
<div>- Transformation
effective/power (loop nests,
quality of transformations,
#vectorizable loops etc)</div>
<div><br>
</div>
<div>A) is mostly Polly
independent (except for the isl
question I guess). For B and C
performance/ compile-time
/opt-viewer data on a
decent/wide range of benchmarks
possibly at different
optimization levels (O2, O3,
LTO, PGO etc and combinations)
should provide data-driven
insight into costs/benefits. <br>
</div>
</div>
</blockquote>
<br>
</span> I agree. In practice, the first
question is: Are will willing to take on
Polly (+isl), in whole or in part, as a
build dependency? If the answer is yes,
the next question is: what parts should
be reused or refactored for use in other
parts of the pipeline? My argument is
that we should take on Polly, or most of
it, as a build dependency. Work on
better unifying the developer
communities as we start experimenting
with other kinds of integration. This
will, however, allow us to provide to
all of our users these transformations
through pragmas (and other kinds of
optional enablement). This is an
important first step.<br>
</div>
</blockquote>
<div><br>
</div>
<div>When you write that we should take
Polly as a build dependency: are you
envisioning this to be tied into LLVM to
the point where we can't build LLVM
without Polly?</div>
<div>I thought the approach would rather be
that a default build of LLVM would have
Polly available but that there will
"forever" be a `-DENABLE_POLLY=OFF`
option?</div>
<div>This seems like an important
distinction to me, as making it more
integrated but still optional for the
correct behavior of LLVM means that people
can continue to work and maintain
"forever" an optimization pipeline which
operates without Polly, while starting to
get more integration in a pluggable form.
I believe this is the direction that was
taken with the GSoC last year, trying to
get Polly as a "pluggable" dependency
analysis.</div>
</div>
</div>
</div>
</blockquote>
<br>
</div>
</div>
I don't want to make statements about "forever":</div>
</blockquote>
<div><br>
</div>
<div>Right, I forgot the asterisk to explain my quotes
around "forever": I meant by this "not until we have
compelling data and agreement to do otherwise". I think
we're on agreement on this though :)</div>
</div>
</div>
</div>
</blockquote>
<br>
I think we are as well :-)<br>
<br>
<blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> eventually, I'd
like to have strong loop optimizations as an integrated
part of LLVM. At that point, "Polly" might be little
more than a shorthand for some parts of LLVM. Now LLVM
currently has an issue whereby it's difficult to build
customized versions with only a subset of needed passes,
but that's an orthogonal issue. In the short term, it
would certainly be separable. I believe that we'd
certainly keep the `-DENABLE_POLLY=OFF` option as long
as we felt it made sense.<br>
<br>
By build dependency, I mean: The sources are part of the
main source tree, they're built by default, thus
essentially all of the builders build them, and they'll
end up in many distributions and derived products. This
will allow us to build on what's there in whatever ways
make sense.</div>
</blockquote>
<div><br>
</div>
<div>Thanks for clarifying!</div>
<div><br>
</div>
<div> </div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"><span class="">
<blockquote type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0 0 0 .8ex;border-left:1px
#ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000"> <br>
I'm not sure exactly how good this is, but
polly has LNT-submitting bots, so the
website can generate a comparison (e.g., <a
moz-do-not-send="true"
class="m_-8232648019342385258m_-2166750394864443591moz-txt-link-freetext"
href="http://lnt.llvm.org/db_default/v4/nts/71208?compare_to=71182"
target="_blank">http://lnt.llvm.org/db_default<wbr>/v4/nts/71208?compare_to=71182</a><wbr>).
Looking at this comparison shows a number
of potential problems but also cases where
Polly really helps (and, FWIW, the largest
two compile-time regressions are also
associated with very large execution
performance improvements). My first focus
would certainly be on pragma-driven
enablement.<br>
</div>
</blockquote>
<div><br>
</div>
<div>I'll add: PGO-driven enablement: spending
compile-time where we know it can have an
impact. What do you think?</div>
</div>
</div>
</div>
</blockquote>
<br>
</span> Certainly (although, obviously, we still need
reasonable confidence that it won't cause performance
regressions to do that).<br>
</div>
</blockquote>
<div><br>
</div>
<div>Oh right, I meant it through an opt-in flag to begin
with (-fenable-pgo-polly or something like that).</div>
</div>
</div>
</div>
</blockquote>
<br>
SGTM.<br>
<br>
-Hal<br>
<br>
<blockquote
cite="mid:CANF-O=bDEoDV=wD-K1_qafrGdZE4J0Re0BGm1=SB3+Rg==A=cA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div><br>
</div>
<div>-- </div>
<div>Mehdi</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
<br>
<pre class="moz-signature" cols="72">--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory</pre>
</body>
</html>