[llvm-dev] (no subject)

Wed Mar 8 09:51:06 PST 2017

On 03/08, Mehdi Amini wrote:
> 
> > On Mar 8, 2017, at 5:36 AM, Johannes Doerfert <doerfert at cs.uni-saarland.de> wrote:
> > 
> > <mehdi.amini at apple.com>, 
> > Bcc: 
> > Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension
> > Reply-To: 
> > In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
> > 
> > Ping.
> > 
> > 
> > 
> > 
> > PS.
> > 
> >  Are there actually people interested in this?
> > 
> >  We will continue working anyway but it might not make sense to put it
> >  on reviews and announce it on the ML if nobody cares.
> 
> I expect this to takes many months, unless it is on the critical path
> of some core devs, because it seems to me to be a huge feature to
> consider and quite a time sink (which is why even people that care
> haven’t been able to invest enough time into this).
>
> My impression is that this is major enough that it is unlikely to land
> in LLVM 5.0, and even before the next US dev meeting. Ideally we’d
> have enough discussions and review before then that we could have a
> BoF there to make sure we can move forward with the right
> representation. I hope the EuroLLVM meeting will be the opportunity
> for you to get some good feedback on this!
> 
> Just my 2 cents :)
Thanks Mehdi :)

I get your points and I generally agree. My intention was to get some
kind of feedback to determine if (*) and possibly when people are going
to invest that time. Since I already have one answer in that regard all
is good!

Cheers,
  Johannes

(*) The RFC and patches were basically uncommented for quite some time now.

PS.
  Maybe we should have scheduled a BoF at EuroLLVM for this general
  topic. Are interested people coming?

> > On 02/24, Johannes Doerfert via llvm-dev wrote:
> >> To (re)start a discussion here we published two patches to the
> >> phabricator:
> >> 
> >> 1) A documentation draft (including LangRef) for the PIR instructions
> >>   and the parallel regions [0].
> >> 
> >> 2) A draft implementation for a ParallelRegionInfo pass that identifies
> >>   and maintains parallel regions in the CFG [1]. It also serves as a
> >>   verifier pass for now.
> >> 
> >> Please take a look and comment the patches either in the phabricator or
> >> here on the list.
> >> 
> >> Thanks!
> >> 
> >> [0] https://reviews.llvm.org/D30353
> >> [1] https://reviews.llvm.org/D30354
> >> 
> >> 
> >> On 01/28, Johannes Doerfert via llvm-dev wrote:
> >>> Dear all,
> >>> 
> >>> This RFC proposes three new LLVM IR instructions to express high-level
> >>> parallel constructs in a simple, low-level fashion. For this first stage
> >>> we prepared two commits that add the proposed instructions and a pass to
> >>> lower them to obtain sequential IR. Both patches have be uploaded for
> >>> review [1, 2]. The latter patch is very simple and the former consists
> >>> of almost only mechanical changes needed to add new instructions.
> >>> 
> >>> The rest of this email contains (1) an introduction of the IR extension
> >>> (2) the reasoning behind this approach, (3) a comparison to other ideas
> >>> proposed so far, (4) a validation of the feasibility and potential
> >>> impact, and (5) an outlook on the next steps.
> >>> 
> >>> (1) IR extension:
> >>> Parallel IR adds three new terminator instructions that define the
> >>> beginning and the end of parallel regions in the CFG. A parallel region
> >>> is a connected subgraph of the CFG that is potentially executed by two
> >>> threads in parallel. It can only be entered with a fork instruction and
> >>> spreads till a join instruction is reached. Therefor parallel regions
> >>> are single-entry-multiple-exit regions. Parallel regions can be nested
> >>> and if they are, they form a parallel region tree similar to the loop
> >>> tree maintained by the natural loop info pass. Each parallel region
> >>> defines two independent “sibling” tasks, namely the forked and
> >>> continuation task.
> >>> 
> >>> The new instructions are defined as follows:
> >>> 
> >>> 1. fork: marks the beginning of parallel region. Every fork has two
> >>>         successor blocks which represent two parallel tasks. We call
> >>>         these two “sibling” tasks the forked and continuation tasks.
> >>>         Nested forking is supported, meaning that another fork can be
> >>>         reach prior to the join.
> >>> 
> >>> 2. halt: marks the end of a forked task. The "sibling" continuation block
> >>>         (see fork above) is the operand of the halt terminator. This
> >>>         represents the idea of asymmetric parallelism as introduced by
> >>>         [1]. One advantage of asymmetric parallelism is that sequential
> >>>         semantics of the program are clear from its CFG (ref. [1]).
> >>>         Note that the edge from a forked block to a continuation block
> >>>         (the one introduced by the halt) represents the control flow
> >>>         when the two successors of a fork execute sequentially, not
> >>>         when they execute in parallel. In the latter case there is no
> >>>         “control transfer” happening via this edge but only
> >>>         synchronization between the tasks.
> >>> 
> >>> 3. join: marks a synchronization point and the end of a parallel region.
> >>>         Once a join terminator is reached by a thread, execution stops
> >>>         in that thread until all tasks spawned by that thread finish
> >>>         their work, thus reach their respective halt instruction.  A
> >>>         join shall only be reached by the continuation task of a fork,
> >>>         the forked task shall reach a halt with the continuation as a
> >>>         successor.
> >>> 
> >>> 
> >>> Here is an example of a parallel OpenMP loop and its idiomatic lowering
> >>> to Parallel IR. We set up a wiki [0] with additional examples.
> >>> 
> >>> #pragma omp parallel
> >>> for(int i = 0; i < n; ++i) {
> >>> A[i] = C[i];
> >>> }
> >>> 
> >>> 
> >>> preheader:
> >>>  br label %header
> >>> 
> >>> header:
> >>>  %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
> >>>  %done = icmp ge %i, %n
> >>>  br i1 %done, label %exit, label %body
> >>> 
> >>> body:
> >>>  fork label %task, label %latch
> >>> 
> >>> task:
> >>>  %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
> >>>  %aval = load i32* %aptr
> >>>  %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
> >>>  store i32 %aval, i32* %aptr
> >>>  halt label %latch
> >>> 
> >>> latch:
> >>>  %inc = add i32, i32 %i, i32 1
> >>>  br label %header
> >>> 
> >>> exit:
> >>>  join label %afterloop
> >>> 
> >>> afterloop:
> >>> ...
> >>> 
> >>> 
> >>> 
> >>> (2) Reasoning:
> >>> The proposed approach is crafted such that the semantics of the parallel
> >>> program is represented correctly in almost native, low-level IR right
> >>> after front-end and preserved at any point till the final lowering to
> >>> sequential IR or parallel runtime library calls. To this end, asymmetric
> >>> parallelism is employed, a concept that uses control flow and the common
> >>> concept of dominance to represent parts of the parallel semantics. In
> >>> this model the parallel tasks do not dominate each other and only one
> >>> parallel task dominates the code after the parallel region. As a
> >>> consequence, various transformations that would break assumptions we
> >>> make about parallel regions cannot happen (see [3,4]). While the
> >>> explicitly modeled control flow together with dominance prevents various
> >>> code motion problems, the use of terminators helps to minimize the
> >>> changes needed to educate passes about parallel regions. Only a fraction
> >>> of analysis and transformation passes deal with terminators explicitly.
> >>> Most passes either test for known terminators (like branches), rely on
> >>> dominance information, or work on a basic block level. To even further
> >>> reduce changes to the existing passes, high-level concepts are broken
> >>> down to already available low-level concepts instead of introducing new,
> >>> semantically rich instructions/intrinsics (see the last paragraph of [5]
> >>> and section 4 in the PIR white paper [6] for examples). Finally, this
> >>> scheme allows a pass to simply reason about the sequential semantics of
> >>> a parallel region, transform it back to one if needed or deemed
> >>> beneficial and employ existing tooling solutions to debug and analyze
> >>> the code [7].
> >>> 
> >>> (3) Comparison:
> >>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region
> >>> annotations” [9] both list pros and cons of different proposed schemes
> >>> and implementations. We summarize and comment the discussion on the ones
> >>> listen in the recent RFC here:
> >>>  (a) Metadata: It seems a consensus has been reached that metadata is
> >>>        not the solution but only a means to enhance a different solution.
> >>>  (c) One Intrinsic per directive/clause: This approach basically embeds
> >>>        a high-level (parallel) language in LLVM IR using intrinsics. It
> >>>        seems there is little to no support for this approach at the
> >>>        moment.  
> >>>  (d) Parallel loop/region annotations: Here, intrinsics enclosing a
> >>>        parallel loop/region are used to represent parallelism.
> >>>        High-level knowledge is represented as attached metadata or in
> >>>        separate intrinsics. For more details please see the original
> >>>        RFC [9]. In the discussion several potential drawbacks have been
> >>>        mentioned:
> >>>        - The annotations might be too general [10].  
> >>>        - The IR is not semantically correct (or ready for optimization)
> >>>          after the front-end and needs an additional “prepare phase for
> >>>          pre-privatization" [11].
> >>>        - The currently available “potential side effect for intrinsic
> >>>          calls” seem not to suffice for the proposed intrinsics as they
> >>>          do not have "call semantics" [12]. 
> >>>  (b) Parallel instructions (this approach): The table in the region RFC
> >>>        [9] lists two drawbacks with this approach, both of which have
> >>>        already been called into question [5]. The first drawback is the
> >>>        effort needed to implement this scheme which is discussed in
> >>>        more detail in section (4) of this mail. The second drawback is
> >>>        the need for additional representation of high-level information
> >>>        that is not part of the semantics of the new fork-join
> >>>        instructions. As mentioned above, the choice to keep the new
> >>>        instructions as simple as possible is deliberate. This parallel
> >>>        IR is intended to be extensible, and in particular, compatible
> >>>        with representations of high-level parallel concepts that might
> >>>        be developed in the future. For the time being, the parallel IR
> >>>        is compatible with approach taken today of lowering high-level
> >>>        parallel linguistics, such as reductions and private memory, to
> >>>        existing IR constructs, such as parallel-runtime calls,
> >>>        atomicrmw instructions, and well placed alloca’s [5,6]. Although
> >>>        other extensions to the IR might allow LLVM to compile these
> >>>        higher-level constructs more effectively, we see no reason the
> >>>        parallel IR would conflict with any such extensions. (On the
> >>>        contrary, the parallel IR would seem to help compiler analyses
> >>>        of higher-level parallel constructs by exposing logical
> >>>        parallelism.)  
> >>> 
> >>> 
> >>> (4) Feasibility and Impact:
> >>> The Tapir and PIR prototypes demonstrate the feasibility of this
> >>> approach. The Tapir prototype [13] has recently proven its robustness as
> >>> the standard compiler in the MIT class on parallel programming. It was
> >>> implemented in ~ 5k LOC. However, >1k are explicit parallel
> >>> optimization, 1k is used to add new instructions (thus mechanical) and
> >>> 2k are used to lower the parallelism (basically needed for any scheme).
> >>> Only the rest is required to make it work with existing analysis and
> >>> transformation passes. While Tapir added explicit optimization passes
> >>> for parallel regions/loops, the representation allows for a variety of
> >>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with
> >>> little to no modifications. Potential speedups compared to a classic
> >>> “early-outlining” approaches can also be seen in the Tapir paper [13].
> >>> For the PIR prototype [14] we modified only three transformation passes
> >>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a
> >>> parallel matrix multiplication. 
> >>> 
> >>> Together, these prototypes show how little passes actually inspect new
> >>> (or “unknown”) terminators. The default assumption passes have to make,
> >>> namely that control might be transferred to any successors at runtime,
> >>> has, in terms of potential compiler transformations, a similar effect as
> >>> the parallel semantics we want to model, namely that control is
> >>> transferred to all successors. 
> >>> 
> >>> 
> >>> (5) Outlook:
> >>> This first stage will only introduce and test the new instructions and
> >>> the sequentialization pass. Afterwards we intend to start additions in
> >>> different, partially overlapping but often orthogonal directions. We do
> >>> welcome comments as well as developers for each of them:
> >>> - Analysis and optimization:
> >>>    * A “parallel region info” pass to keep track of parallel regions
> >>>      and their nesting. The information can be made accessible in a
> >>>      “parallel region tree” similar to the loop tree maintained by the
> >>>      loop info pass. [stage 1, immediate next goal]
> >>>    * Extension of the verifier that allow to check parallel IR for
> >>>      “well-formedness”. [stage 1, immediate next goal]
> >>>    * Documentation of the PIR instructions in the language reference.
> >>>      [stage 1, immediate next goal]
> >>>    * A cost analysis for parallel tasks that can be queried by
> >>>      optimizations. The cost model needs to take the hardware, the
> >>>      runtime library and the parallel tasks into account.
> >>>    * Vectorizer enhancements to enable the vectorization of parallel
> >>>    * loops and tasks.
> >>>    * Parallelization centric optimizations: 
> >>>      a) Parallel tasks can be balanced, merged or split as well as created
> >>>         from and lowered to sequential code.
> >>>      b) Barriers can be eliminated.
> >>>      c) Parallel loops can be statically scheduled or created from
> >>>         parallel recursive calls [13]
> >>>    * Analysis to extract high-level information (reductions, private
> >>>      memory, ...) from the low-level representation.
> >>> - Front-end:
> >>>    * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
> >>>      parallel sections and parallel loops with limited support for
> >>>      clauses (at first) (examples can be found here [1]). [milestone 1]
> >>>    * Generation of PIR code through automatic parallelization. A
> >>>      patched version of Polly exists that emits parallel loops using PIR instead of
> >>>      OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
> >>>    * Representation of more evolved high-level features like assignment
> >>>      of computation units.
> >>> - Back-end:
> >>>    * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
> >>>      runtime library. [milestone 1]
> >>>    * A simple parallel library, e.g., based on pthreads, to be shipped
> >>>      with LLVM as a fallback implementation for parallel regions.
> >>> 
> >>> 
> >>> Thank you all for your time and hopefully constructive input on this proposal!
> >>> 
> >>> Cheers,
> >>>  Johannes, on behalf of the PIR team
> >>> 
> >>> 
> >>> Disclaimer:
> >>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
> >>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
> >>> University), Simon Moll (Saarland University) and myself. However, ideas
> >>> and feedback came from many people, including the members of the
> >>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the
> >>> participants in the BoF at the US Developers’ meeting, everybody that
> >>> commented on the BoF discussion sheet [16] and the recent RFC on
> >>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin,
> >>> ...).
> >>> 
> >>> 
> >>> 
> >>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
> >>> [1] https://reviews.llvm.org/D29250
> >>> [2] https://reviews.llvm.org/D29251
> >>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
> >>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
> >>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
> >>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
> >>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf  & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
> >>> [8] https://goo.gl/Blp2Xr
> >>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
> >>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
> >>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
> >>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
> >>> [13] http://wsmoses.com/tapir.pdf
> >>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
> >>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a
> >>> [16] https://goo.gl/wKps3c
> >>> 
> >>> 
> >>> -- 
> >>> 
> >>> Johannes Doerfert
> >>> Researcher / PhD Student
> >>> 
> >>> Compiler Design Lab (Prof. Hack)
> >>> Saarland Informatics Campus, Germany
> >>> Building E1.3, Room 4.31
> >>> 
> >>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
> >>> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> >> 
> >> 
> >> 
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> llvm-dev at lists.llvm.org
> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> 
> >> 
> >> -- 
> >> 
> >> Johannes Doerfert
> >> Researcher / PhD Student
> >> 
> >> Compiler Design Lab (Prof. Hack)
> >> Saarland Informatics Campus, Germany
> >> Building E1.3, Room 4.31
> >> 
> >> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
> >> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> > 
> > 
> > 
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> > 
> > -- 
> > 
> > Johannes Doerfert
> > Researcher / PhD Student
> > 
> > Compiler Design Lab (Prof. Hack)
> > Saarland Informatics Campus, Germany
> > Building E1.3, Room 4.31
> > 
> > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
> > Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> 

-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170308/aee7fbc4/attachment.sig>