[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension

Tue Feb 7 13:25:27 PST 2017

Hi Sanjoy,

On 02/06, Sanjoy Das wrote:
> Sorry for the delayed response!
No worries.

> I have some basic questions inline:
Answers inlined.

> On Sat, Jan 28, 2017 at 6:07 AM, Johannes Doerfert via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
> > This RFC proposes three new LLVM IR instructions to express high-level
> > parallel constructs in a simple, low-level fashion. For this first stage
> > we prepared two commits that add the proposed instructions and a pass to
> > lower them to obtain sequential IR. Both patches have be uploaded for
> > review [1, 2]. The latter patch is very simple and the former consists
> > of almost only mechanical changes needed to add new instructions.
> >
> > The rest of this email contains (1) an introduction of the IR extension
> > (2) the reasoning behind this approach, (3) a comparison to other ideas
> > proposed so far, (4) a validation of the feasibility and potential
> > impact, and (5) an outlook on the next steps.
> >
> > (1) IR extension:
> > Parallel IR adds three new terminator instructions that define the
> > beginning and the end of parallel regions in the CFG. A parallel region
> > is a connected subgraph of the CFG that is potentially executed by two
> > threads in parallel. It can only be entered with a fork instruction and
> > spreads till a join instruction is reached. Therefor parallel regions
> > are single-entry-multiple-exit regions. Parallel regions can be nested
> > and if they are, they form a parallel region tree similar to the loop
> > tree maintained by the natural loop info pass. Each parallel region
> > defines two independent “sibling” tasks, namely the forked and
> > continuation task.
> >
> > The new instructions are defined as follows:
> >
> > 1. fork: marks the beginning of parallel region. Every fork has two
> >          successor blocks which represent two parallel tasks. We call
> >          these two “sibling” tasks the forked and continuation tasks.
> >          Nested forking is supported, meaning that another fork can be
> >          reach prior to the join.
> >
> > 2. halt: marks the end of a forked task. The "sibling" continuation block
> >          (see fork above) is the operand of the halt terminator. This
> >          represents the idea of asymmetric parallelism as introduced by
> >          [1]. One advantage of asymmetric parallelism is that sequential
> >          semantics of the program are clear from its CFG (ref. [1]).
> >          Note that the edge from a forked block to a continuation block
> >          (the one introduced by the halt) represents the control flow
> >          when the two successors of a fork execute sequentially, not
> >          when they execute in parallel. In the latter case there is no
> >          “control transfer” happening via this edge but only
> >          synchronization between the tasks.
> >
> > 3. join: marks a synchronization point and the end of a parallel region.
> >          Once a join terminator is reached by a thread, execution stops
> >          in that thread until all tasks spawned by that thread finish
> >          their work, thus reach their respective halt instruction.  A
> >          join shall only be reached by the continuation task of a fork,
> >          the forked task shall reach a halt with the continuation as a
> >          successor.
> >
> >
> > Here is an example of a parallel OpenMP loop and its idiomatic lowering
> > to Parallel IR. We set up a wiki [0] with additional examples.
> >
> > #pragma omp parallel
> > for(int i = 0; i < n; ++i) {
> >  A[i] = C[i];
> > }
> >
> >
> > preheader:
> >   br label %header
> >
> > header:
> >   %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
> >   %done = icmp ge %i, %n
> >   br i1 %done, label %exit, label %body
> >
> > body:
> >   fork label %task, label %latch
> >
> > task:
> >   %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
> >   %aval = load i32* %aptr
> >   %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
> >   store i32 %aval, i32* %aptr
> >   halt label %latch
> >
> > latch:
> 
> Can we have a PHI node in this block?  If yes, how is the incoming
> value for %task computed when %task and %latch are running in
> parallel?
You _cannot_ place PHI nodes here. As you noticed, it is not obvious
which value should be forwarded to the PHI as _both_ predecessors are
executed.

> >   %inc = add i32, i32 %i, i32 1
> >   br label %header
> >
> > exit:
> >   join label %afterloop
> >
> > afterloop:
> > ...
> 
> Looks like there are no edges to %exit from "inside the loop"?  What
> is the control flow here?

Right, %exit is _not_ part of the loop but it is the synchronization point
_after_ the loop. The loop spawns one iteration at a time and after all
have been started it will wait at the join till all have finished.

> > (2) Reasoning:
> > The proposed approach is crafted such that the semantics of the parallel
> > program is represented correctly in almost native, low-level IR right
> > after front-end and preserved at any point till the final lowering to
> > sequential IR or parallel runtime library calls. To this end, asymmetric
> > parallelism is employed, a concept that uses control flow and the common
> > concept of dominance to represent parts of the parallel semantics. In
> > this model the parallel tasks do not dominate each other and only one
> > parallel task dominates the code after the parallel region. As a
> 
> Can you give an example to show what you mean by "only one parallel
> task dominates the code after the parallel region"?
I mean that %a below is not dominating %cont but only %b is.

> What about cases like these (in quasi-llvm syntax):
> 
>  body:
>    fork label %a, label %b
>  a:
>    x = alloca
>    use(x) // but not escape
>    halt label %b
>  b:
>    y = alloca
>    use(y) // but not escape
>    br label %cont
>  cont:
>    ...
> 
> =>
> 
>    common_alloca = alloca
>  body:
>    fork label %a, label %b
>  a:
>    use(common_alloca) // but not escape
>    halt label %b
>  b:
>    use(common_alloca) // but not escape
>    br label %cont
>  cont:
>    ...
> 
> 
> As far as I can tell, nothing in the IR tells LLVM that %a and %b may
> "interfere" with each other (by running in parallel).
Correct, this is something we have to teach LLVM. The fork does have an
effect to the memory _and_ to the stack. Since allocas (in loops) that
could escape have to be treated differently from other allocas we should
have code paths and checks in the pipeline we can leverage here.

> > consequence, various transformations that would break assumptions we
> > make about parallel regions cannot happen (see [3,4]). While the
> > explicitly modeled control flow together with dominance prevents various
> > code motion problems, the use of terminators helps to minimize the
> > changes needed to educate passes about parallel regions. Only a fraction
> > of analysis and transformation passes deal with terminators explicitly.
> > Most passes either test for known terminators (like branches), rely on
> > dominance information, or work on a basic block level. To even further
> > reduce changes to the existing passes, high-level concepts are broken
> > down to already available low-level concepts instead of introducing new,
> > semantically rich instructions/intrinsics (see the last paragraph of [5]
> > and section 4 in the PIR white paper [6] for examples). Finally, this
> > scheme allows a pass to simply reason about the sequential semantics of
> > a parallel region, transform it back to one if needed or deemed
> > beneficial and employ existing tooling solutions to debug and analyze
> > the code [7].
> >
> > (3) Comparison:
> > The BoF discussion sheet [8] and the recent “[RFC] on IR-level region
> > annotations” [9] both list pros and cons of different proposed schemes
> > and implementations. We summarize and comment the discussion on the ones
> > listen in the recent RFC here:
> >   (a) Metadata: It seems a consensus has been reached that metadata is
> >         not the solution but only a means to enhance a different solution.
> >   (c) One Intrinsic per directive/clause: This approach basically embeds
> >         a high-level (parallel) language in LLVM IR using intrinsics. It
> >         seems there is little to no support for this approach at the
> >         moment.
> >   (d) Parallel loop/region annotations: Here, intrinsics enclosing a
> >         parallel loop/region are used to represent parallelism.
> >         High-level knowledge is represented as attached metadata or in
> >         separate intrinsics. For more details please see the original
> >         RFC [9]. In the discussion several potential drawbacks have been
> >         mentioned:
> >         - The annotations might be too general [10].
> >         - The IR is not semantically correct (or ready for optimization)
> >           after the front-end and needs an additional “prepare phase for
> >           pre-privatization" [11].
> >         - The currently available “potential side effect for intrinsic
> >           calls” seem not to suffice for the proposed intrinsics as they
> >           do not have "call semantics" [12].
> >   (b) Parallel instructions (this approach): The table in the region RFC
> >         [9] lists two drawbacks with this approach, both of which have
> >         already been called into question [5]. The first drawback is the
> >         effort needed to implement this scheme which is discussed in
> >         more detail in section (4) of this mail. The second drawback is
> >         the need for additional representation of high-level information
> >         that is not part of the semantics of the new fork-join
> >         instructions. As mentioned above, the choice to keep the new
> >         instructions as simple as possible is deliberate. This parallel
> >         IR is intended to be extensible, and in particular, compatible
> >         with representations of high-level parallel concepts that might
> >         be developed in the future. For the time being, the parallel IR
> >         is compatible with approach taken today of lowering high-level
> >         parallel linguistics, such as reductions and private memory, to
> >         existing IR constructs, such as parallel-runtime calls,
> >         atomicrmw instructions, and well placed alloca’s [5,6]. Although
> >         other extensions to the IR might allow LLVM to compile these
> >         higher-level constructs more effectively, we see no reason the
> >         parallel IR would conflict with any such extensions. (On the
> >         contrary, the parallel IR would seem to help compiler analyses
> >         of higher-level parallel constructs by exposing logical
> >         parallelism.)
> >
> >
> > (4) Feasibility and Impact:
> > The Tapir and PIR prototypes demonstrate the feasibility of this
> > approach. The Tapir prototype [13] has recently proven its robustness as
> > the standard compiler in the MIT class on parallel programming. It was
> > implemented in ~ 5k LOC. However, >1k are explicit parallel
> > optimization, 1k is used to add new instructions (thus mechanical) and
> > 2k are used to lower the parallelism (basically needed for any scheme).
> > Only the rest is required to make it work with existing analysis and
> > transformation passes. While Tapir added explicit optimization passes
> > for parallel regions/loops, the representation allows for a variety of
> > classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with
> > little to no modifications. Potential speedups compared to a classic
> > “early-outlining” approaches can also be seen in the Tapir paper [13].
> > For the PIR prototype [14] we modified only three transformation passes
> > (<20 LOC) [15] before we could run the O3 pipeline successfully on a
> > parallel matrix multiplication.
> >
> > Together, these prototypes show how little passes actually inspect new
> > (or “unknown”) terminators. The default assumption passes have to make,
> > namely that control might be transferred to any successors at runtime,
> > has, in terms of potential compiler transformations, a similar effect as
> > the parallel semantics we want to model, namely that control is
> > transferred to all successors.
> >
> >
> > (5) Outlook:
> > This first stage will only introduce and test the new instructions and
> > the sequentialization pass. Afterwards we intend to start additions in
> > different, partially overlapping but often orthogonal directions. We do
> > welcome comments as well as developers for each of them:
> >  - Analysis and optimization:
> >     * A “parallel region info” pass to keep track of parallel regions
> >       and their nesting. The information can be made accessible in a
> >       “parallel region tree” similar to the loop tree maintained by the
> >       loop info pass. [stage 1, immediate next goal]
> >     * Extension of the verifier that allow to check parallel IR for
> >       “well-formedness”. [stage 1, immediate next goal]
> >     * Documentation of the PIR instructions in the language reference.
> >       [stage 1, immediate next goal]
> >     * A cost analysis for parallel tasks that can be queried by
> >       optimizations. The cost model needs to take the hardware, the
> >       runtime library and the parallel tasks into account.
> >     * Vectorizer enhancements to enable the vectorization of parallel
> >     * loops and tasks.
> >     * Parallelization centric optimizations:
> >       a) Parallel tasks can be balanced, merged or split as well as created
> >          from and lowered to sequential code.
> >       b) Barriers can be eliminated.
> >       c) Parallel loops can be statically scheduled or created from
> >          parallel recursive calls [13]
> >     * Analysis to extract high-level information (reductions, private
> >       memory, ...) from the low-level representation.
> >  - Front-end:
> >     * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
> >       parallel sections and parallel loops with limited support for
> >       clauses (at first) (examples can be found here [1]). [milestone 1]
> >     * Generation of PIR code through automatic parallelization. A
> >       patched version of Polly exists that emits parallel loops using PIR instead of
> >       OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
> >     * Representation of more evolved high-level features like assignment
> >       of computation units.
> >  - Back-end:
> >     * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
> >       runtime library. [milestone 1]
> >     * A simple parallel library, e.g., based on pthreads, to be shipped
> >       with LLVM as a fallback implementation for parallel regions.
> >
> >
> > Thank you all for your time and hopefully constructive input on this proposal!
> >
> > Cheers,
> >   Johannes, on behalf of the PIR team
> >
> >
> > Disclaimer:
> > This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
> > Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
> > University), Simon Moll (Saarland University) and myself. However, ideas
> > and feedback came from many people, including the members of the
> > LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the
> > participants in the BoF at the US Developers’ meeting, everybody that
> > commented on the BoF discussion sheet [16] and the recent RFC on
> > IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin,
> > ...).
> >
> >
> >
> > [0] https://github.com/Parallel-IR/llvm-pir/wiki
> > [1] https://reviews.llvm.org/D29250
> > [2] https://reviews.llvm.org/D29251
> > [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
> > [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
> > [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
> > [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
> > [7] http://supertech.csail.mit.edu/papers/spbags.pdf  & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
> > [8] https://goo.gl/Blp2Xr
> > [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
> > [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
> > [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
> > [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
> > [13] http://wsmoses.com/tapir.pdf
> > [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
> > [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a
> > [16] https://goo.gl/wKps3c
> >
> >
> > --
> >
> > Johannes Doerfert
> > Researcher / PhD Student
> >
> > Compiler Design Lab (Prof. Hack)
> > Saarland Informatics Campus, Germany
> > Building E1.3, Room 4.31
> >
> > Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
> > Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >

-- 

Johannes Doerfert
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland Informatics Campus, Germany
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: Digital signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170207/3e5ee744/attachment.sig>