[llvm-dev] [RFC] Polly Status and Integration

Tue Sep 19 17:33:23 PDT 2017

Hi Hal, Tobias, Michael, and others,

I'd like to add my view (and a proposal) to this discussion and I
apologize directly for doing this so late*. I also want to apologize
because this email is long, contains various technical details and also
argumentations that might need more justification. However, I am happy
to provide further information (and/or examples) to explain my views if
required.

tldr;
We should introduce polyhedral analysis and optimization capabilities
into LLVM step by step. Along the way we should revisit design decisions
made by Polly and adjust them according to the use cases in the rest of
LLVM. Finally, we should keep Polly as a stand-alone tool to allow
research into, and prototyping of, complex loop transformations.

* There was a paper deadline end of last week and I simply had to prioritize.

-------------------------------------------------------------------------------

LLVM performs "simple" loop transformations and targets almost
exclusively inner-most loops. Polly offers a pipeline to perform
"complex" polyhedral-based loop optimizations on a sequence of loop
nests. It seems natural to integrate Polly into the LLVM pipeline as it
apparently complements it well. However, I do not believe that it is
wise to integrate Polly into LLVM for a variety of reasons, some of
which I will explain in more detail now. Afterwards I will propose a
different way to achieve (most of) the things Hal mentioned in his mail
in an incremental way. Nevertheless, I want to stress that I am still an
advocate of Polly and I honestly believe it to be a valuable part of the
LLVM environment. Though I strongly feel it should be continued as a
research tool and test bed to allow research into, and prototyping of,
(polyhedral-model-based) analysis and optimizations.

---

Polly is a deep pipeline of passes that were developed for the sole
purpose of applying scheduling transformations at the very end. This
"main purpose" of Polly makes it hard to effectively reuse intermediate
parts, e.g. the dependence analysis or the code generation framework, at
least to the degree we might want to. In a different thread [1] Hal
asked for the possibility to reuse Polly for tuning via pragmas e.g.,
unrolling/interleaving/tiling/interchange. While this is interesting and
generally possible it won't take long before the restrictions Polly
places on the input code (to enable scheduling optimizations in the end)
become a serious problem. The point here is that Polly has too many
features and dependences for such an "easy" use case. (Note that I do
not question the legality of a user requested transformation here.) If
you are not looking for "end-to-end polyhedral loop optimizations" you
want a system that answers a very specific question (e.g.,
isVectorizable) or performs a very specific transformation (e.g, tiling)
on a maximal set of programs. In contrast, Polly is (and I strongly
believe it should be) developed as a system that performs complex
optimizations, based on specialized analysis results, on a constraint
type of programs.

One might argue that Polly will become more modular when it is integrated
into LLVM and when the intermediate results are used in more and more
places. However, I'd say this is the "wrong direction" with regards to:
  1) incremental design and development,
  2) maintainability of individual parts, and
  3) the development of a suitable cost model (which Polly does not ship).

Instead of starting with a full, hard to understand, scheduling pipeline
we should start with the use cases at hand and design specific analysis
(and also transformation) passes "from scratch". The long term goal
should be a full scheduling pipeline but the parts need to be designed
in a modular (LLVM-way) from the very beginning. This allows us to:
  - provide __immediate benefit__ to other developers,
  - allow active participation in (and understanding of) the design of
    each part, and
  - develop the intermediate parts with the requirements of the whole
    LLVM project in mind.

Let me give two examples to make my point:

-- Example 1 --
In addition to the applicability issues there are other problems that
arise from the full pipeline design. Let's assume we want to apply a
pragma driven optimization to a loop that can be represented (and
optimized) by Polly. Depending on the pragma we only need to perform a
very specific task, e.g., adjust the iteration variable (loop inversion)
or introduce new loops and adjust existing iteration variables (tiling).
Polly will however represent (and actually analyze) the whole loop nest
including the exact control flow and all accesses. In addition it
will compute alias checks, constraints under which the representation
is not truthful (e.g., an access function might overflow) and a lot of
other things that are not needed for the task at hand.

-- Example 2 --
We can also imagine we want to determine if a loop can be vectorized,
thus if there are (short) loop carried dependences. No matter which
technique is used, the complexity of a dependence analysis will
certainly increase with the number (and complexity) of the analyzed
accesses. Since Polly is designed with fine-grained scheduling
optimizations in mind, the dependences it will compute are
disproportionate with regards to the question asked. Consequently, it
will either take more time to determine if there are loop carried
dependences or a time-out in the computation will cause a conservatively
negative result.

---

While it is generally possible to disable some "features" of Polly or to
"work around them", it requires new independent code paths in an already
highly interleaved project. The explicit (and implicit) interactions
between the different parts of Polly are plentiful and subtle. I don't
think there are 5 people that know about more than half of them. (I also
doubt there is no one that knows them all.)

---

I want to propose an alternative plan for which I would appreciate
feedback. The feedback is especially useful since I will present this,
or more accurately the concrete polyhedral analyses described below, in
the student research competition at LLVM Dev meeting in October.

The idea is simple. We rewrite parts of Polly with two goals in mind:
  1) to provide analysis results and transformation options to LLVM, and
  2) allow polyhedral loop transformation.

A rewrite allows to revisit various design decisions that are buried
deep in the Polly code and that hinder its application (in any context).
The interested reader can find a short list at the end.

I think, we should incrementally introduce polyhedral analysis
capabilities in order to provide immediate benefits and to allow
developers to get involved with the design and usage from an early point
in time. Decoupled analyses also allow easier integration, interfaces
that come "more natural" to non-polyhedral people, and better use of the
capabilities if polyhedral scheduling is not the goal.

To demonstrate this incremental process I started to write a polyhedral
value analysis**. Think of it as Scalar Evolution but with piece-wise
defined polyhedral values as the abstract domain. This analysis is
demand driven, works on the granularity of a LLVM value, is flow
sensitive, and especially designed to be applicable on partially affine
programs. It already supports most of what the polyhedral model does
allow and includes a lot of things we prototyped in Polly. It shall be
used when Scalar Evolution is not able to provide a sufficient result.
On top I implemented a polyhedral memory analysis*** that is (currently)
capable of summarizing the memory effects of a code region, e.g., a
function. The next step is a demand-driven dependence analysis**** that
utilizes the (often) very structured nature of a CFG. Afterwards, I
would like to see a transformation framework that allows classical but
also scheduling based transformations.

While these analyses still depend on isl as a back-end, they never
directly use it. Instead there is a C++ interface in-between that
encapsulates the features needed. Once our use cases are clear we can
re-implement this interface.

** For comparison, the value analysis covers parts of the ScopDetection,
all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and
ScopBuilder in Polly.
*** The memory analysis is concerned with the construction and
functionality Polly implements in the MemoryAccess class.
**** The dependence analysis is similar to Polly's DependenceInfo class
but is supposed to allow more fine-grained control.

---

These are a few design decisions that I think should be revisited when
we think about polyhedral-based analysis and optimizations in LLVM:

  - We should not rely on Scalar Evolution. Instead we want to build
    polyhedral values straight from the underlying IR. While
    Scalar Evolution did certainly simplify Polly's development it
    induced various problems:
      - The lack of piece-wise affine values (other than min/max)
        limits applicability and precision. (Though conditional SCEVs
        are a small step in this direction.)
      - The flow insensitive value analysis (Scalar Evolution) causes a
        lot of "assumptions", thus versioning, for the otherwise flow
        sensitive Polly.
      - There are caching issues when parts of the program are
        transformed or when Polly is used for inter-procedural
        analysis/optimization.

  - We should not depend on single-entry-single-exit (SESE) regions.
    Instead we analyze the code optimistically to the extend possible
    (or required by the user). Afterwards we can easily specialize the
    results to the code regions we are interested in. While SESE regions
    always caused compile time issues (due to the use of the post
    dominance tree) they also limit the analysis/transformations that
    can be performed. As an example think of accesses that are
    non-affine with regards to the smallest surrounding SESE region but
    not in a smaller code region (e.g., a conditional). If the goal is
    not loop scheduling but, for example, instruction reordering, it
    is still valuable to determine the dependences between these
    accesses in the smaller code region.

  - We should iteratively and selective construct polyhedral
    representations. The same holds for analysis results, e.g.
    dependences. Let's reconsider example 2 from above. We want to visit
    one memory access at a time in order to build the polyhedral
    representation and compute the new dependences. We also want to
    immediately stop if we identify a loop-carried dependence. Finally,
    we do not want to compute dependences in outer loop iterations or
    dependences that are completely contained in one loop iteration.

  - We should encapsulate the code generation part completely from the
    transformation part. Additionally we might want to start with
    classical loop transformations instead of full polyhedral
    scheduling. For a lot of classical loop transformations code
    generation only needs to understand/change loop induction variables
    and exit conditions. Duplication + rewriting of the loop body is
    often not necessary. Pragma driven optimizations could be easily
    done and we could also use heuristics similar to the ones we have
    today (e.g., for loop distribution). The transformations we allow
    should than grow with the scheduling decisions we allow and these
    should be limited by a yet to be determined cost model.

---

I am aware that not all Polly developers will agree with my views and
that some are simply trade-offs that do not offer a "correct way".
Nevertheless, I hope that the simple examples I used to point out
various problems are enough to consider the route I proposed.

Cheers,
  Johannes

[1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ

On 09/01, Hal Finkel via llvm-dev wrote:
> **
> 
> *Hi everyone,As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on many
> applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by state-of-the-art
> dependence analysis. Polly also provides automated parallelization and
> targeting of GPUs and other**accelerators.*
> 
> *
> 
> Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments:
> 
> 
>  *
> 
>    Polly now runs, by default, in the conceptually-proper place in
>    LLVM’s pass pipeline (just before the loop vectorizer). Importantly,
>    this means that its loop transformations are performed after
>    inlining and other canonicalization, greatly increasing its
>    robustness, and enabling its use on C++ code (where [] is often a
>    function call before inlining).
> 
>  *
> 
>    Polly’s cost-modeling parameters, such as those describing the
>    target’s memory hierarchy, are being integrated with
>    TargetTransformInfo. This allows targets to properly override the
>    modeling parameters and allows reuse of these parameters by other
>    clients.
> 
>  *
> 
>    Polly’s method of handling signed division/remainder operations,
>    which worked around lack of support in ScalarEvolution, is being
>    replaced thanks to improvements being contributed to ScalarEvolution
>    itself (see D34598). Polly’s core delinearization routines have long
>    been a part of LLVM itself.
> 
>  *
> 
>    PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
>    use by other clients, is now available.
> 
>  *
> 
>    Polly is now part of the LLVM release process and is being included
>    with LLVM by various packagers (e.g., Debian).
> 
> 
> I believe that the LLVM community would benefit from beginning the process
> of integrating Polly with LLVM itself and continuing its development as part
> of our main code base. This will:
> 
>  *
> 
>    Allow for wider adoption of LLVM within communities relying on
>    advanced loop transformations.
> 
>  *
> 
>    Provide for better community feedback on, and testing of, the code
>    developed (although the story in this regard is already fairly solid).
> 
>  *
> 
>    Better motivate targets to provide accurate, comprehensive, modeling
>    parameters for use by advanced loop transformations.
> 
>  *
> 
>    Perhaps most importantly, this will allow us to develop and tune the
>    rest of the optimizer assuming that Polly’s capabilities are present
>    (the underlying analysis, and eventually, the transformations
>    themselves).
> 
> 
> The largest issue on which community consensus is required, in order to move
> forward at all, is what to do with isl. isl, the Integer Set Library,
> provides core functionality on which Polly depends. It is a C library, and
> while some Polly/LLVM developers are also isl developers, it has a large
> user community outside of LLVM/Polly. A C++ interface was recently added,
> and Polly is transitioning to use the C++ interface. Nevertheless, options
> here include rewriting the needed functionality, forking isl and
> transitioning our fork toward LLVM coding conventions (and data structures)
> over time, and incorporating isl more-or-less as-is to avoid partitioning
> its development.
> 
> 
> That having been said, isl is internally modular, and regardless of the
> overall integration strategy, the Polly developers anticipate specializing,
> or even replacing, some of these components with LLVM-specific solutions.
> This is especially true for anything that touches performance-related
> heuristics and modeling. LLVM-specific, or even target-specific, loop
> schedulers may be developed as well.
> 
> 
> Even though some developers in the LLVM community already have a background
> in polyhedral-modeling techniques, the Polly developers have developed, and
> are still developing, extensive tutorials on this topic
> http://pollylabs.org/education.htmland especially
> http://playground.pollylabs.org.
> 
> 
> Finally, let me highlight a few ongoing development efforts in Polly that
> are potentially relevant to this discussion. Polly’s loop analysis is sound
> and technically superior to what’s in LLVM currently (i.e. in
> LoopAccessAnalysis and DependenceAnalysis). There are, however, two known
> reasons why Polly’s transformations could not yet be enabled by default:
> 
>  *
> 
>    A correctness issue: Currently, Polly assumes that 64 bits is large
>    enough for all new loop-induction variables and index expressions.
>    In rare cases, transformations could be performed where more bits
>    are required. Preconditions need to be generated preventing this
>    (e.g., D35471).
> 
>  *
> 
>    A performance issue: Polly currently models temporal locality (i.e.,
>    it tries to get better reuse in time), but does not model spatial
>    locality (i.e., it does not model cache-line reuse). As a result, it
>    can sometimes introduce performance regressions. Polly Labs is
>    currently working on integrating spatial locality modeling into the
>    loop optimization model.
> 
> Polly can already split apart basic blocks in order to implement loop
> fusion. Heuristics to choose at which granularity are still being
> implemented (e.g., PR12402).
> 
> I believe that we can now develop a concrete plan for moving
> state-of-the-art loop optimizations, based on the technology in the Polly
> project, into LLVM. Doing so will enable LLVM to be competitive with
> proprietary compilers in high-performance computing, machine learning, and
> other important application domains. I'd like community feedback on
> what**should be part of that plan.
> 
> 
> Sincerely,
> 
> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with feedback
> from**several other active Polly developers)
> 
> 
> We thank the numerous people who have contributed to the Polly
> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs, Annanay
> Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan, Benjamin
> Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris Jenneisch,
> ChristianBielert, Daniel Dunbar, Daniel Jasper, David Blaikie, David
> Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith, Eli Friedman,
> EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin Zheng, Huihui
> Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner, Karthik Senthil,
> LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt Arsenault,
> MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse, Matthias
> Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini Singhal,
> NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip Pfaffe, Philipp
> Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola, Raghesh Aloor,
> ReidKleckner, Roal Jordans, Richard Membarth, Roman Gareev,
> SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer AbuAsal,
> SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
> Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Star Tan,
> TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros Theodoridis, Utpal
> Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
> 
> -- 
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
> 

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170920/2d76e352/attachment.sig>