[llvm-dev] [RFC] Polly Status and Integration

Wed Sep 20 02:33:46 PDT 2017

Hi,
I have not been following the polyhedral developments in both GCC and LLVM very closely the last few years, but I was just wondering if there are any lessons learned we should look at (or actually already aware of) in GCC's implementation and integration of Graphite. For example, do we know if it is (still)  enabled/used/etc.? And in the context of possibly rewritten bits and pieces of Polly, thus making it modular, and making its exact analysis available to "traditional" optimisations passes such as e.g. the vectoriser, is there data that that shows that it really improves performance? Is there e.g. data available that shows the performance uplift of the vectoriser using the exact polyhedral data dependence analysis over the traditional data dependence analysis? I am a big fan of the polyhedral model and these are not leading questions, but I was just wondering if that could help in deciding to make it modular and slowly integrating it, or just to have it as a separate pipeline.

Cheers,
Sjoerd.

-----Original Message-----
From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Johannes Doerfert via llvm-dev
Sent: 20 September 2017 01:33
To: Hal Finkel
Cc: LLVM Dev; Tobias Grosser; michaelkruse at meinersbur.de
Subject: Re: [llvm-dev] [RFC] Polly Status and Integration

Hi Hal, Tobias, Michael, and others,

I'd like to add my view (and a proposal) to this discussion and I apologize directly for doing this so late*. I also want to apologize because this email is long, contains various technical details and also argumentations that might need more justification. However, I am happy to provide further information (and/or examples) to explain my views if required.

tldr;
We should introduce polyhedral analysis and optimization capabilities into LLVM step by step. Along the way we should revisit design decisions made by Polly and adjust them according to the use cases in the rest of LLVM. Finally, we should keep Polly as a stand-alone tool to allow research into, and prototyping of, complex loop transformations.

* There was a paper deadline end of last week and I simply had to prioritize.

-------------------------------------------------------------------------------

LLVM performs "simple" loop transformations and targets almost exclusively inner-most loops. Polly offers a pipeline to perform "complex" polyhedral-based loop optimizations on a sequence of loop nests. It seems natural to integrate Polly into the LLVM pipeline as it apparently complements it well. However, I do not believe that it is wise to integrate Polly into LLVM for a variety of reasons, some of which I will explain in more detail now. Afterwards I will propose a different way to achieve (most of) the things Hal mentioned in his mail in an incremental way. Nevertheless, I want to stress that I am still an advocate of Polly and I honestly believe it to be a valuable part of the LLVM environment. Though I strongly feel it should be continued as a research tool and test bed to allow research into, and prototyping of,
(polyhedral-model-based) analysis and optimizations.

---

Polly is a deep pipeline of passes that were developed for the sole purpose of applying scheduling transformations at the very end. This "main purpose" of Polly makes it hard to effectively reuse intermediate parts, e.g. the dependence analysis or the code generation framework, at least to the degree we might want to. In a different thread [1] Hal asked for the possibility to reuse Polly for tuning via pragmas e.g., unrolling/interleaving/tiling/interchange. While this is interesting and generally possible it won't take long before the restrictions Polly places on the input code (to enable scheduling optimizations in the end) become a serious problem. The point here is that Polly has too many features and dependences for such an "easy" use case. (Note that I do not question the legality of a user requested transformation here.) If you are not looking for "end-to-end polyhedral loop optimizations" you want a system that answers a very specific question (e.g.,
isVectorizable) or performs a very specific transformation (e.g, tiling) on a maximal set of programs. In contrast, Polly is (and I strongly believe it should be) developed as a system that performs complex optimizations, based on specialized analysis results, on a constraint type of programs.

One might argue that Polly will become more modular when it is integrated into LLVM and when the intermediate results are used in more and more places. However, I'd say this is the "wrong direction" with regards to:
  1) incremental design and development,
  2) maintainability of individual parts, and
  3) the development of a suitable cost model (which Polly does not ship).

Instead of starting with a full, hard to understand, scheduling pipeline we should start with the use cases at hand and design specific analysis (and also transformation) passes "from scratch". The long term goal should be a full scheduling pipeline but the parts need to be designed in a modular (LLVM-way) from the very beginning. This allows us to:
  - provide __immediate benefit__ to other developers,
  - allow active participation in (and understanding of) the design of
    each part, and
  - develop the intermediate parts with the requirements of the whole
    LLVM project in mind.

Let me give two examples to make my point:

-- Example 1 --
In addition to the applicability issues there are other problems that arise from the full pipeline design. Let's assume we want to apply a pragma driven optimization to a loop that can be represented (and
optimized) by Polly. Depending on the pragma we only need to perform a very specific task, e.g., adjust the iteration variable (loop inversion) or introduce new loops and adjust existing iteration variables (tiling).
Polly will however represent (and actually analyze) the whole loop nest including the exact control flow and all accesses. In addition it will compute alias checks, constraints under which the representation is not truthful (e.g., an access function might overflow) and a lot of other things that are not needed for the task at hand.

-- Example 2 --
We can also imagine we want to determine if a loop can be vectorized, thus if there are (short) loop carried dependences. No matter which technique is used, the complexity of a dependence analysis will certainly increase with the number (and complexity) of the analyzed accesses. Since Polly is designed with fine-grained scheduling optimizations in mind, the dependences it will compute are disproportionate with regards to the question asked. Consequently, it will either take more time to determine if there are loop carried dependences or a time-out in the computation will cause a conservatively negative result.

---

While it is generally possible to disable some "features" of Polly or to "work around them", it requires new independent code paths in an already highly interleaved project. The explicit (and implicit) interactions between the different parts of Polly are plentiful and subtle. I don't think there are 5 people that know about more than half of them. (I also doubt there is no one that knows them all.)

---

I want to propose an alternative plan for which I would appreciate feedback. The feedback is especially useful since I will present this, or more accurately the concrete polyhedral analyses described below, in the student research competition at LLVM Dev meeting in October.

The idea is simple. We rewrite parts of Polly with two goals in mind:
  1) to provide analysis results and transformation options to LLVM, and
  2) allow polyhedral loop transformation.

A rewrite allows to revisit various design decisions that are buried deep in the Polly code and that hinder its application (in any context).
The interested reader can find a short list at the end.

I think, we should incrementally introduce polyhedral analysis capabilities in order to provide immediate benefits and to allow developers to get involved with the design and usage from an early point in time. Decoupled analyses also allow easier integration, interfaces that come "more natural" to non-polyhedral people, and better use of the capabilities if polyhedral scheduling is not the goal.

To demonstrate this incremental process I started to write a polyhedral value analysis**. Think of it as Scalar Evolution but with piece-wise defined polyhedral values as the abstract domain. This analysis is demand driven, works on the granularity of a LLVM value, is flow sensitive, and especially designed to be applicable on partially affine programs. It already supports most of what the polyhedral model does allow and includes a lot of things we prototyped in Polly. It shall be used when Scalar Evolution is not able to provide a sufficient result.
On top I implemented a polyhedral memory analysis*** that is (currently) capable of summarizing the memory effects of a code region, e.g., a function. The next step is a demand-driven dependence analysis**** that utilizes the (often) very structured nature of a CFG. Afterwards, I would like to see a transformation framework that allows classical but also scheduling based transformations.

While these analyses still depend on isl as a back-end, they never directly use it. Instead there is a C++ interface in-between that encapsulates the features needed. Once our use cases are clear we can re-implement this interface.

** For comparison, the value analysis covers parts of the ScopDetection, all of SCEVValidator and SCEVAffinator as well as parts of ScopInfo and ScopBuilder in Polly.
*** The memory analysis is concerned with the construction and functionality Polly implements in the MemoryAccess class.
**** The dependence analysis is similar to Polly's DependenceInfo class but is supposed to allow more fine-grained control.

---

These are a few design decisions that I think should be revisited when we think about polyhedral-based analysis and optimizations in LLVM:

  - We should not rely on Scalar Evolution. Instead we want to build
    polyhedral values straight from the underlying IR. While
    Scalar Evolution did certainly simplify Polly's development it
    induced various problems:
      - The lack of piece-wise affine values (other than min/max)
        limits applicability and precision. (Though conditional SCEVs
        are a small step in this direction.)
      - The flow insensitive value analysis (Scalar Evolution) causes a
        lot of "assumptions", thus versioning, for the otherwise flow
        sensitive Polly.
      - There are caching issues when parts of the program are
        transformed or when Polly is used for inter-procedural
        analysis/optimization.

  - We should not depend on single-entry-single-exit (SESE) regions.
    Instead we analyze the code optimistically to the extend possible
    (or required by the user). Afterwards we can easily specialize the
    results to the code regions we are interested in. While SESE regions
    always caused compile time issues (due to the use of the post
    dominance tree) they also limit the analysis/transformations that
    can be performed. As an example think of accesses that are
    non-affine with regards to the smallest surrounding SESE region but
    not in a smaller code region (e.g., a conditional). If the goal is
    not loop scheduling but, for example, instruction reordering, it
    is still valuable to determine the dependences between these
    accesses in the smaller code region.

  - We should iteratively and selective construct polyhedral
    representations. The same holds for analysis results, e.g.
    dependences. Let's reconsider example 2 from above. We want to visit
    one memory access at a time in order to build the polyhedral
    representation and compute the new dependences. We also want to
    immediately stop if we identify a loop-carried dependence. Finally,
    we do not want to compute dependences in outer loop iterations or
    dependences that are completely contained in one loop iteration.

  - We should encapsulate the code generation part completely from the
    transformation part. Additionally we might want to start with
    classical loop transformations instead of full polyhedral
    scheduling. For a lot of classical loop transformations code
    generation only needs to understand/change loop induction variables
    and exit conditions. Duplication + rewriting of the loop body is
    often not necessary. Pragma driven optimizations could be easily
    done and we could also use heuristics similar to the ones we have
    today (e.g., for loop distribution). The transformations we allow
    should than grow with the scheduling decisions we allow and these
    should be limited by a yet to be determined cost model.

---

I am aware that not all Polly developers will agree with my views and that some are simply trade-offs that do not offer a "correct way".
Nevertheless, I hope that the simple examples I used to point out various problems are enough to consider the route I proposed.

Cheers,
  Johannes

[1] https://groups.google.com/d/msg/polly-dev/ACSRlFqtpDE/t5EDlIwtAgAJ

On 09/01, Hal Finkel via llvm-dev wrote:
> **
>
> *Hi everyone,As you may know, stock LLVM does not provide the kind of
> advanced loop transformations necessary to provide good performance on
> many applications. LLVM's Polly project provides many of the required
> capabilities, including loop transformations such as fission, fusion,
> skewing, blocking/tiling, and interchange, all powered by
> state-of-the-art dependence analysis. Polly also provides automated
> parallelization and targeting of GPUs and other**accelerators.*
>
> *
>
> Over the past year, Polly’s development has focused on robustness,
> correctness, and closer integration with LLVM. To highlight a few
> accomplishments:
>
>
>  *
>
>    Polly now runs, by default, in the conceptually-proper place in
>    LLVM’s pass pipeline (just before the loop vectorizer). Importantly,
>    this means that its loop transformations are performed after
>    inlining and other canonicalization, greatly increasing its
>    robustness, and enabling its use on C++ code (where [] is often a
>    function call before inlining).
>
>  *
>
>    Polly’s cost-modeling parameters, such as those describing the
>    target’s memory hierarchy, are being integrated with
>    TargetTransformInfo. This allows targets to properly override the
>    modeling parameters and allows reuse of these parameters by other
>    clients.
>
>  *
>
>    Polly’s method of handling signed division/remainder operations,
>    which worked around lack of support in ScalarEvolution, is being
>    replaced thanks to improvements being contributed to ScalarEvolution
>    itself (see D34598). Polly’s core delinearization routines have long
>    been a part of LLVM itself.
>
>  *
>
>    PolyhedralInfo, which exposes a subset of Polly’s loop analysis for
>    use by other clients, is now available.
>
>  *
>
>    Polly is now part of the LLVM release process and is being included
>    with LLVM by various packagers (e.g., Debian).
>
>
> I believe that the LLVM community would benefit from beginning the
> process of integrating Polly with LLVM itself and continuing its
> development as part of our main code base. This will:
>
>  *
>
>    Allow for wider adoption of LLVM within communities relying on
>    advanced loop transformations.
>
>  *
>
>    Provide for better community feedback on, and testing of, the code
>    developed (although the story in this regard is already fairly solid).
>
>  *
>
>    Better motivate targets to provide accurate, comprehensive, modeling
>    parameters for use by advanced loop transformations.
>
>  *
>
>    Perhaps most importantly, this will allow us to develop and tune the
>    rest of the optimizer assuming that Polly’s capabilities are present
>    (the underlying analysis, and eventually, the transformations
>    themselves).
>
>
> The largest issue on which community consensus is required, in order
> to move forward at all, is what to do with isl. isl, the Integer Set
> Library, provides core functionality on which Polly depends. It is a C
> library, and while some Polly/LLVM developers are also isl developers,
> it has a large user community outside of LLVM/Polly. A C++ interface
> was recently added, and Polly is transitioning to use the C++
> interface. Nevertheless, options here include rewriting the needed
> functionality, forking isl and transitioning our fork toward LLVM
> coding conventions (and data structures) over time, and incorporating
> isl more-or-less as-is to avoid partitioning its development.
>
>
> That having been said, isl is internally modular, and regardless of
> the overall integration strategy, the Polly developers anticipate
> specializing, or even replacing, some of these components with LLVM-specific solutions.
> This is especially true for anything that touches performance-related
> heuristics and modeling. LLVM-specific, or even target-specific, loop
> schedulers may be developed as well.
>
>
> Even though some developers in the LLVM community already have a
> background in polyhedral-modeling techniques, the Polly developers
> have developed, and are still developing, extensive tutorials on this
> topic http://pollylabs.org/education.htmland especially
> http://playground.pollylabs.org.
>
>
> Finally, let me highlight a few ongoing development efforts in Polly
> that are potentially relevant to this discussion. Polly’s loop
> analysis is sound and technically superior to what’s in LLVM currently
> (i.e. in LoopAccessAnalysis and DependenceAnalysis). There are,
> however, two known reasons why Polly’s transformations could not yet be enabled by default:
>
>  *
>
>    A correctness issue: Currently, Polly assumes that 64 bits is large
>    enough for all new loop-induction variables and index expressions.
>    In rare cases, transformations could be performed where more bits
>    are required. Preconditions need to be generated preventing this
>    (e.g., D35471).
>
>  *
>
>    A performance issue: Polly currently models temporal locality (i.e.,
>    it tries to get better reuse in time), but does not model spatial
>    locality (i.e., it does not model cache-line reuse). As a result, it
>    can sometimes introduce performance regressions. Polly Labs is
>    currently working on integrating spatial locality modeling into the
>    loop optimization model.
>
> Polly can already split apart basic blocks in order to implement loop
> fusion. Heuristics to choose at which granularity are still being
> implemented (e.g., PR12402).
>
> I believe that we can now develop a concrete plan for moving
> state-of-the-art loop optimizations, based on the technology in the
> Polly project, into LLVM. Doing so will enable LLVM to be competitive
> with proprietary compilers in high-performance computing, machine
> learning, and other important application domains. I'd like community
> feedback on what**should be part of that plan.
>
>
> Sincerely,
>
> Hal (on behalf of myself, Tobias Grosser, and Michael Kruse, with
> feedback from**several other active Polly developers)
>
>
> We thank the numerous people who have contributed to the Polly
> infrastructure:Alexandre Isoard, Andreas Simbuerger, Andy Gibbs,
> Annanay Agarwal, ArminGroesslinger, Ajith Pandel, Baranidharan Mohan,
> Benjamin Kramer, BillWendling, Chandler Carruth, Craig Topper, Chris
> Jenneisch, ChristianBielert, Daniel Dunbar, Daniel Jasper, David
> Blaikie, David Peixotto,Dmitry N. Mikushin, Duncan P. N. Exon Smith,
> Eli Friedman, EugeneZelenko, George Burgess IV, Hans Wennborg, Hongbin
> Zheng, Huihui Zhang,Jakub Kuderski, Johannes Doerfert, Justin Bogner,
> Karthik Senthil, LoganChien, Lawrence Hu, Mandeep Singh Grang, Matt
> Arsenault, MatthewSimpson, Mehdi Amini, Micah Villmow, Michael Kruse,
> Matthias Reisinger,Maximilian Falkenstein, Nakamura Takumi, Nandini
> Singhal, NicolasBonfante, Patrik Hägglund, Paul Robinson, Philip
> Pfaffe, Philipp Schaad,Peter Conn, Pratik Bhatu, Rafael Espindola,
> Raghesh Aloor, ReidKleckner, Roal Jordans, Richard Membarth, Roman
> Gareev, SaleemAbdulrasool, Sameer Sahasrabuddhe, Sanjoy Das, Sameer
> AbuAsal, SamNovak, Sebastian Pop, Siddharth Bhat, Singapuram Sanjay
> Srivallabh,Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru,
> Star Tan, TanyaLattner, Tim Shen, Tarun Ranjendran, Theodoros
> Theodoridis, Utpal Bora,Wei Mi, Weiming Zhao, and Yabin Hu.*
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages Leadership
> Computing Facility Argonne National Laboratory
>

> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

--

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.