[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension

Mon Feb 6 23:55:23 PST 2017

Hi Johannes,

Sorry for the delayed response!  I have some basic questions inline:

On Sat, Jan 28, 2017 at 6:07 AM, Johannes Doerfert via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> This RFC proposes three new LLVM IR instructions to express high-level
> parallel constructs in a simple, low-level fashion. For this first stage
> we prepared two commits that add the proposed instructions and a pass to
> lower them to obtain sequential IR. Both patches have be uploaded for
> review [1, 2]. The latter patch is very simple and the former consists
> of almost only mechanical changes needed to add new instructions.
>
> The rest of this email contains (1) an introduction of the IR extension
> (2) the reasoning behind this approach, (3) a comparison to other ideas
> proposed so far, (4) a validation of the feasibility and potential
> impact, and (5) an outlook on the next steps.
>
> (1) IR extension:
> Parallel IR adds three new terminator instructions that define the
> beginning and the end of parallel regions in the CFG. A parallel region
> is a connected subgraph of the CFG that is potentially executed by two
> threads in parallel. It can only be entered with a fork instruction and
> spreads till a join instruction is reached. Therefor parallel regions
> are single-entry-multiple-exit regions. Parallel regions can be nested
> and if they are, they form a parallel region tree similar to the loop
> tree maintained by the natural loop info pass. Each parallel region
> defines two independent “sibling” tasks, namely the forked and
> continuation task.
>
> The new instructions are defined as follows:
>
> 1. fork: marks the beginning of parallel region. Every fork has two
>          successor blocks which represent two parallel tasks. We call
>          these two “sibling” tasks the forked and continuation tasks.
>          Nested forking is supported, meaning that another fork can be
>          reach prior to the join.
>
> 2. halt: marks the end of a forked task. The "sibling" continuation block
>          (see fork above) is the operand of the halt terminator. This
>          represents the idea of asymmetric parallelism as introduced by
>          [1]. One advantage of asymmetric parallelism is that sequential
>          semantics of the program are clear from its CFG (ref. [1]).
>          Note that the edge from a forked block to a continuation block
>          (the one introduced by the halt) represents the control flow
>          when the two successors of a fork execute sequentially, not
>          when they execute in parallel. In the latter case there is no
>          “control transfer” happening via this edge but only
>          synchronization between the tasks.
>
> 3. join: marks a synchronization point and the end of a parallel region.
>          Once a join terminator is reached by a thread, execution stops
>          in that thread until all tasks spawned by that thread finish
>          their work, thus reach their respective halt instruction.  A
>          join shall only be reached by the continuation task of a fork,
>          the forked task shall reach a halt with the continuation as a
>          successor.
>
>
> Here is an example of a parallel OpenMP loop and its idiomatic lowering
> to Parallel IR. We set up a wiki [0] with additional examples.
>
> #pragma omp parallel
> for(int i = 0; i < n; ++i) {
>  A[i] = C[i];
> }
>
>
> preheader:
>   br label %header
>
> header:
>   %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
>   %done = icmp ge %i, %n
>   br i1 %done, label %exit, label %body
>
> body:
>   fork label %task, label %latch
>
> task:
>   %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
>   %aval = load i32* %aptr
>   %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
>   store i32 %aval, i32* %aptr
>   halt label %latch
>
> latch:

Can we have a PHI node in this block?  If yes, how is the incoming
value for %task computed when %task and %latch are running in
parallel?

>   %inc = add i32, i32 %i, i32 1
>   br label %header
>
> exit:
>   join label %afterloop
>
> afterloop:
> ...

Looks like there are no edges to %exit from "inside the loop"?  What
is the control flow here?

> (2) Reasoning:
> The proposed approach is crafted such that the semantics of the parallel
> program is represented correctly in almost native, low-level IR right
> after front-end and preserved at any point till the final lowering to
> sequential IR or parallel runtime library calls. To this end, asymmetric
> parallelism is employed, a concept that uses control flow and the common
> concept of dominance to represent parts of the parallel semantics. In
> this model the parallel tasks do not dominate each other and only one
> parallel task dominates the code after the parallel region. As a

Can you give an example to show what you mean by "only one parallel
task dominates the code after the parallel region"?

What about cases like these (in quasi-llvm syntax):

 body:
   fork label %a, label %b
 a:
   x = alloca
   use(x) // but not escape
   halt label %b
 b:
   y = alloca
   use(y) // but not escape
   br label %cont
 cant:
   ...

=>

   common_alloca = alloca
 body:
   fork label %a, label %b
 a:
   use(common_alloca) // but not escape
   halt label %b
 b:
   use(common_alloca) // but not escape
   br label %cont
 cant:
   ...

As far as I can tell, nothing in the IR tells LLVM that %a and %b may
"interfere" with each other (by running in parallel).

> consequence, various transformations that would break assumptions we
> make about parallel regions cannot happen (see [3,4]). While the
> explicitly modeled control flow together with dominance prevents various
> code motion problems, the use of terminators helps to minimize the
> changes needed to educate passes about parallel regions. Only a fraction
> of analysis and transformation passes deal with terminators explicitly.
> Most passes either test for known terminators (like branches), rely on
> dominance information, or work on a basic block level. To even further
> reduce changes to the existing passes, high-level concepts are broken
> down to already available low-level concepts instead of introducing new,
> semantically rich instructions/intrinsics (see the last paragraph of [5]
> and section 4 in the PIR white paper [6] for examples). Finally, this
> scheme allows a pass to simply reason about the sequential semantics of
> a parallel region, transform it back to one if needed or deemed
> beneficial and employ existing tooling solutions to debug and analyze
> the code [7].
>
> (3) Comparison:
> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region
> annotations” [9] both list pros and cons of different proposed schemes
> and implementations. We summarize and comment the discussion on the ones
> listen in the recent RFC here:
>   (a) Metadata: It seems a consensus has been reached that metadata is
>         not the solution but only a means to enhance a different solution.
>   (c) One Intrinsic per directive/clause: This approach basically embeds
>         a high-level (parallel) language in LLVM IR using intrinsics. It
>         seems there is little to no support for this approach at the
>         moment.
>   (d) Parallel loop/region annotations: Here, intrinsics enclosing a
>         parallel loop/region are used to represent parallelism.
>         High-level knowledge is represented as attached metadata or in
>         separate intrinsics. For more details please see the original
>         RFC [9]. In the discussion several potential drawbacks have been
>         mentioned:
>         - The annotations might be too general [10].
>         - The IR is not semantically correct (or ready for optimization)
>           after the front-end and needs an additional “prepare phase for
>           pre-privatization" [11].
>         - The currently available “potential side effect for intrinsic
>           calls” seem not to suffice for the proposed intrinsics as they
>           do not have "call semantics" [12].
>   (b) Parallel instructions (this approach): The table in the region RFC
>         [9] lists two drawbacks with this approach, both of which have
>         already been called into question [5]. The first drawback is the
>         effort needed to implement this scheme which is discussed in
>         more detail in section (4) of this mail. The second drawback is
>         the need for additional representation of high-level information
>         that is not part of the semantics of the new fork-join
>         instructions. As mentioned above, the choice to keep the new
>         instructions as simple as possible is deliberate. This parallel
>         IR is intended to be extensible, and in particular, compatible
>         with representations of high-level parallel concepts that might
>         be developed in the future. For the time being, the parallel IR
>         is compatible with approach taken today of lowering high-level
>         parallel linguistics, such as reductions and private memory, to
>         existing IR constructs, such as parallel-runtime calls,
>         atomicrmw instructions, and well placed alloca’s [5,6]. Although
>         other extensions to the IR might allow LLVM to compile these
>         higher-level constructs more effectively, we see no reason the
>         parallel IR would conflict with any such extensions. (On the
>         contrary, the parallel IR would seem to help compiler analyses
>         of higher-level parallel constructs by exposing logical
>         parallelism.)
>
>
> (4) Feasibility and Impact:
> The Tapir and PIR prototypes demonstrate the feasibility of this
> approach. The Tapir prototype [13] has recently proven its robustness as
> the standard compiler in the MIT class on parallel programming. It was
> implemented in ~ 5k LOC. However, >1k are explicit parallel
> optimization, 1k is used to add new instructions (thus mechanical) and
> 2k are used to lower the parallelism (basically needed for any scheme).
> Only the rest is required to make it work with existing analysis and
> transformation passes. While Tapir added explicit optimization passes
> for parallel regions/loops, the representation allows for a variety of
> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with
> little to no modifications. Potential speedups compared to a classic
> “early-outlining” approaches can also be seen in the Tapir paper [13].
> For the PIR prototype [14] we modified only three transformation passes
> (<20 LOC) [15] before we could run the O3 pipeline successfully on a
> parallel matrix multiplication.
>
> Together, these prototypes show how little passes actually inspect new
> (or “unknown”) terminators. The default assumption passes have to make,
> namely that control might be transferred to any successors at runtime,
> has, in terms of potential compiler transformations, a similar effect as
> the parallel semantics we want to model, namely that control is
> transferred to all successors.
>
>
> (5) Outlook:
> This first stage will only introduce and test the new instructions and
> the sequentialization pass. Afterwards we intend to start additions in
> different, partially overlapping but often orthogonal directions. We do
> welcome comments as well as developers for each of them:
>  - Analysis and optimization:
>     * A “parallel region info” pass to keep track of parallel regions
>       and their nesting. The information can be made accessible in a
>       “parallel region tree” similar to the loop tree maintained by the
>       loop info pass. [stage 1, immediate next goal]
>     * Extension of the verifier that allow to check parallel IR for
>       “well-formedness”. [stage 1, immediate next goal]
>     * Documentation of the PIR instructions in the language reference.
>       [stage 1, immediate next goal]
>     * A cost analysis for parallel tasks that can be queried by
>       optimizations. The cost model needs to take the hardware, the
>       runtime library and the parallel tasks into account.
>     * Vectorizer enhancements to enable the vectorization of parallel
>     * loops and tasks.
>     * Parallelization centric optimizations:
>       a) Parallel tasks can be balanced, merged or split as well as created
>          from and lowered to sequential code.
>       b) Barriers can be eliminated.
>       c) Parallel loops can be statically scheduled or created from
>          parallel recursive calls [13]
>     * Analysis to extract high-level information (reductions, private
>       memory, ...) from the low-level representation.
>  - Front-end:
>     * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
>       parallel sections and parallel loops with limited support for
>       clauses (at first) (examples can be found here [1]). [milestone 1]
>     * Generation of PIR code through automatic parallelization. A
>       patched version of Polly exists that emits parallel loops using PIR instead of
>       OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
>     * Representation of more evolved high-level features like assignment
>       of computation units.
>  - Back-end:
>     * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
>       runtime library. [milestone 1]
>     * A simple parallel library, e.g., based on pthreads, to be shipped
>       with LLVM as a fallback implementation for parallel regions.
>
>
> Thank you all for your time and hopefully constructive input on this proposal!
>
> Cheers,
>   Johannes, on behalf of the PIR team
>
>
> Disclaimer:
> This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
> University), Simon Moll (Saarland University) and myself. However, ideas
> and feedback came from many people, including the members of the
> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the
> participants in the BoF at the US Developers’ meeting, everybody that
> commented on the BoF discussion sheet [16] and the recent RFC on
> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin,
> ...).
>
>
>
> [0] https://github.com/Parallel-IR/llvm-pir/wiki
> [1] https://reviews.llvm.org/D29250
> [2] https://reviews.llvm.org/D29251
> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
> [7] http://supertech.csail.mit.edu/papers/spbags.pdf  & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
> [8] https://goo.gl/Blp2Xr
> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
> [13] http://wsmoses.com/tapir.pdf
> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a
> [16] https://goo.gl/wKps3c
>
>
> --
>
> Johannes Doerfert
> Researcher / PhD Student
>
> Compiler Design Lab (Prof. Hack)
> Saarland Informatics Campus, Germany
> Building E1.3, Room 4.31
>
> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>