[llvm-dev] (no subject)

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 8 05:51:25 PST 2017


On 03/08/2017 07:36 AM, Johannes Doerfert wrote:
> <mehdi.amini at apple.com>,
> Bcc:
> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension
> Reply-To:
> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
>
> Ping.
>
>
>
>
> PS.
>
>    Are there actually people interested in this?
>
>    We will continue working anyway but it might not make sense to put it
>    on reviews and announce it on the ML if nobody cares.

I certainly care, and will also respond to the RFC, this week or next.

  -Hal

>
>
>
> On 02/24, Johannes Doerfert via llvm-dev wrote:
>> To (re)start a discussion here we published two patches to the
>> phabricator:
>>
>> 1) A documentation draft (including LangRef) for the PIR instructions
>>     and the parallel regions [0].
>>
>> 2) A draft implementation for a ParallelRegionInfo pass that identifies
>>     and maintains parallel regions in the CFG [1]. It also serves as a
>>     verifier pass for now.
>>
>> Please take a look and comment the patches either in the phabricator or
>> here on the list.
>>
>> Thanks!
>>
>> [0] https://reviews.llvm.org/D30353
>> [1] https://reviews.llvm.org/D30354
>>
>>
>> On 01/28, Johannes Doerfert via llvm-dev wrote:
>>> Dear all,
>>>
>>> This RFC proposes three new LLVM IR instructions to express high-level
>>> parallel constructs in a simple, low-level fashion. For this first stage
>>> we prepared two commits that add the proposed instructions and a pass to
>>> lower them to obtain sequential IR. Both patches have be uploaded for
>>> review [1, 2]. The latter patch is very simple and the former consists
>>> of almost only mechanical changes needed to add new instructions.
>>>
>>> The rest of this email contains (1) an introduction of the IR extension
>>> (2) the reasoning behind this approach, (3) a comparison to other ideas
>>> proposed so far, (4) a validation of the feasibility and potential
>>> impact, and (5) an outlook on the next steps.
>>>
>>> (1) IR extension:
>>> Parallel IR adds three new terminator instructions that define the
>>> beginning and the end of parallel regions in the CFG. A parallel region
>>> is a connected subgraph of the CFG that is potentially executed by two
>>> threads in parallel. It can only be entered with a fork instruction and
>>> spreads till a join instruction is reached. Therefor parallel regions
>>> are single-entry-multiple-exit regions. Parallel regions can be nested
>>> and if they are, they form a parallel region tree similar to the loop
>>> tree maintained by the natural loop info pass. Each parallel region
>>> defines two independent “sibling” tasks, namely the forked and
>>> continuation task.
>>>
>>> The new instructions are defined as follows:
>>>
>>> 1. fork: marks the beginning of parallel region. Every fork has two
>>>           successor blocks which represent two parallel tasks. We call
>>>           these two “sibling” tasks the forked and continuation tasks.
>>>           Nested forking is supported, meaning that another fork can be
>>>           reach prior to the join.
>>>
>>> 2. halt: marks the end of a forked task. The "sibling" continuation block
>>>           (see fork above) is the operand of the halt terminator. This
>>>           represents the idea of asymmetric parallelism as introduced by
>>>           [1]. One advantage of asymmetric parallelism is that sequential
>>>           semantics of the program are clear from its CFG (ref. [1]).
>>>           Note that the edge from a forked block to a continuation block
>>>           (the one introduced by the halt) represents the control flow
>>>           when the two successors of a fork execute sequentially, not
>>>           when they execute in parallel. In the latter case there is no
>>>           “control transfer” happening via this edge but only
>>>           synchronization between the tasks.
>>>
>>> 3. join: marks a synchronization point and the end of a parallel region.
>>>           Once a join terminator is reached by a thread, execution stops
>>>           in that thread until all tasks spawned by that thread finish
>>>           their work, thus reach their respective halt instruction.  A
>>>           join shall only be reached by the continuation task of a fork,
>>>           the forked task shall reach a halt with the continuation as a
>>>           successor.
>>>
>>>
>>> Here is an example of a parallel OpenMP loop and its idiomatic lowering
>>> to Parallel IR. We set up a wiki [0] with additional examples.
>>>
>>> #pragma omp parallel
>>> for(int i = 0; i < n; ++i) {
>>>   A[i] = C[i];
>>> }
>>>
>>>
>>> preheader:
>>>    br label %header
>>>
>>> header:
>>>    %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
>>>    %done = icmp ge %i, %n
>>>    br i1 %done, label %exit, label %body
>>>
>>> body:
>>>    fork label %task, label %latch
>>>
>>> task:
>>>    %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
>>>    %aval = load i32* %aptr
>>>    %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
>>>    store i32 %aval, i32* %aptr
>>>    halt label %latch
>>>
>>> latch:
>>>    %inc = add i32, i32 %i, i32 1
>>>    br label %header
>>>
>>> exit:
>>>    join label %afterloop
>>>
>>> afterloop:
>>> ...
>>>
>>>
>>>
>>> (2) Reasoning:
>>> The proposed approach is crafted such that the semantics of the parallel
>>> program is represented correctly in almost native, low-level IR right
>>> after front-end and preserved at any point till the final lowering to
>>> sequential IR or parallel runtime library calls. To this end, asymmetric
>>> parallelism is employed, a concept that uses control flow and the common
>>> concept of dominance to represent parts of the parallel semantics. In
>>> this model the parallel tasks do not dominate each other and only one
>>> parallel task dominates the code after the parallel region. As a
>>> consequence, various transformations that would break assumptions we
>>> make about parallel regions cannot happen (see [3,4]). While the
>>> explicitly modeled control flow together with dominance prevents various
>>> code motion problems, the use of terminators helps to minimize the
>>> changes needed to educate passes about parallel regions. Only a fraction
>>> of analysis and transformation passes deal with terminators explicitly.
>>> Most passes either test for known terminators (like branches), rely on
>>> dominance information, or work on a basic block level. To even further
>>> reduce changes to the existing passes, high-level concepts are broken
>>> down to already available low-level concepts instead of introducing new,
>>> semantically rich instructions/intrinsics (see the last paragraph of [5]
>>> and section 4 in the PIR white paper [6] for examples). Finally, this
>>> scheme allows a pass to simply reason about the sequential semantics of
>>> a parallel region, transform it back to one if needed or deemed
>>> beneficial and employ existing tooling solutions to debug and analyze
>>> the code [7].
>>>
>>> (3) Comparison:
>>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level region
>>> annotations” [9] both list pros and cons of different proposed schemes
>>> and implementations. We summarize and comment the discussion on the ones
>>> listen in the recent RFC here:
>>>    (a) Metadata: It seems a consensus has been reached that metadata is
>>>          not the solution but only a means to enhance a different solution.
>>>    (c) One Intrinsic per directive/clause: This approach basically embeds
>>>          a high-level (parallel) language in LLVM IR using intrinsics. It
>>>          seems there is little to no support for this approach at the
>>>          moment.
>>>    (d) Parallel loop/region annotations: Here, intrinsics enclosing a
>>>          parallel loop/region are used to represent parallelism.
>>>          High-level knowledge is represented as attached metadata or in
>>>          separate intrinsics. For more details please see the original
>>>          RFC [9]. In the discussion several potential drawbacks have been
>>>          mentioned:
>>>          - The annotations might be too general [10].
>>>          - The IR is not semantically correct (or ready for optimization)
>>>            after the front-end and needs an additional “prepare phase for
>>>            pre-privatization" [11].
>>>          - The currently available “potential side effect for intrinsic
>>>            calls” seem not to suffice for the proposed intrinsics as they
>>>            do not have "call semantics" [12].
>>>    (b) Parallel instructions (this approach): The table in the region RFC
>>>          [9] lists two drawbacks with this approach, both of which have
>>>          already been called into question [5]. The first drawback is the
>>>          effort needed to implement this scheme which is discussed in
>>>          more detail in section (4) of this mail. The second drawback is
>>>          the need for additional representation of high-level information
>>>          that is not part of the semantics of the new fork-join
>>>          instructions. As mentioned above, the choice to keep the new
>>>          instructions as simple as possible is deliberate. This parallel
>>>          IR is intended to be extensible, and in particular, compatible
>>>          with representations of high-level parallel concepts that might
>>>          be developed in the future. For the time being, the parallel IR
>>>          is compatible with approach taken today of lowering high-level
>>>          parallel linguistics, such as reductions and private memory, to
>>>          existing IR constructs, such as parallel-runtime calls,
>>>          atomicrmw instructions, and well placed alloca’s [5,6]. Although
>>>          other extensions to the IR might allow LLVM to compile these
>>>          higher-level constructs more effectively, we see no reason the
>>>          parallel IR would conflict with any such extensions. (On the
>>>          contrary, the parallel IR would seem to help compiler analyses
>>>          of higher-level parallel constructs by exposing logical
>>>          parallelism.)
>>>
>>>
>>> (4) Feasibility and Impact:
>>> The Tapir and PIR prototypes demonstrate the feasibility of this
>>> approach. The Tapir prototype [13] has recently proven its robustness as
>>> the standard compiler in the MIT class on parallel programming. It was
>>> implemented in ~ 5k LOC. However, >1k are explicit parallel
>>> optimization, 1k is used to add new instructions (thus mechanical) and
>>> 2k are used to lower the parallelism (basically needed for any scheme).
>>> Only the rest is required to make it work with existing analysis and
>>> transformation passes. While Tapir added explicit optimization passes
>>> for parallel regions/loops, the representation allows for a variety of
>>> classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with
>>> little to no modifications. Potential speedups compared to a classic
>>> “early-outlining” approaches can also be seen in the Tapir paper [13].
>>> For the PIR prototype [14] we modified only three transformation passes
>>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a
>>> parallel matrix multiplication.
>>>
>>> Together, these prototypes show how little passes actually inspect new
>>> (or “unknown”) terminators. The default assumption passes have to make,
>>> namely that control might be transferred to any successors at runtime,
>>> has, in terms of potential compiler transformations, a similar effect as
>>> the parallel semantics we want to model, namely that control is
>>> transferred to all successors.
>>>
>>>
>>> (5) Outlook:
>>> This first stage will only introduce and test the new instructions and
>>> the sequentialization pass. Afterwards we intend to start additions in
>>> different, partially overlapping but often orthogonal directions. We do
>>> welcome comments as well as developers for each of them:
>>>   - Analysis and optimization:
>>>      * A “parallel region info” pass to keep track of parallel regions
>>>        and their nesting. The information can be made accessible in a
>>>        “parallel region tree” similar to the loop tree maintained by the
>>>        loop info pass. [stage 1, immediate next goal]
>>>      * Extension of the verifier that allow to check parallel IR for
>>>        “well-formedness”. [stage 1, immediate next goal]
>>>      * Documentation of the PIR instructions in the language reference.
>>>        [stage 1, immediate next goal]
>>>      * A cost analysis for parallel tasks that can be queried by
>>>        optimizations. The cost model needs to take the hardware, the
>>>        runtime library and the parallel tasks into account.
>>>      * Vectorizer enhancements to enable the vectorization of parallel
>>>      * loops and tasks.
>>>      * Parallelization centric optimizations:
>>>        a) Parallel tasks can be balanced, merged or split as well as created
>>>           from and lowered to sequential code.
>>>        b) Barriers can be eliminated.
>>>        c) Parallel loops can be statically scheduled or created from
>>>           parallel recursive calls [13]
>>>      * Analysis to extract high-level information (reductions, private
>>>        memory, ...) from the low-level representation.
>>>   - Front-end:
>>>      * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
>>>        parallel sections and parallel loops with limited support for
>>>        clauses (at first) (examples can be found here [1]). [milestone 1]
>>>      * Generation of PIR code through automatic parallelization. A
>>>        patched version of Polly exists that emits parallel loops using PIR instead of
>>>        OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
>>>      * Representation of more evolved high-level features like assignment
>>>        of computation units.
>>>   - Back-end:
>>>      * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
>>>        runtime library. [milestone 1]
>>>      * A simple parallel library, e.g., based on pthreads, to be shipped
>>>        with LLVM as a fallback implementation for parallel regions.
>>>
>>>
>>> Thank you all for your time and hopefully constructive input on this proposal!
>>>
>>> Cheers,
>>>    Johannes, on behalf of the PIR team
>>>
>>>
>>> Disclaimer:
>>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
>>> University), Simon Moll (Saarland University) and myself. However, ideas
>>> and feedback came from many people, including the members of the
>>> LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the
>>> participants in the BoF at the US Developers’ meeting, everybody that
>>> commented on the BoF discussion sheet [16] and the recent RFC on
>>> IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin,
>>> ...).
>>>   
>>>
>>>
>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
>>> [1] https://reviews.llvm.org/D29250
>>> [2] https://reviews.llvm.org/D29251
>>> [3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
>>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
>>> [5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
>>> [6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
>>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf  & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
>>> [8] https://goo.gl/Blp2Xr
>>> [9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
>>> [10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
>>> [11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
>>> [12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
>>> [13] http://wsmoses.com/tapir.pdf
>>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
>>> [15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a
>>> [16] https://goo.gl/wKps3c
>>>
>>>
>>> -- 
>>>
>>> Johannes Doerfert
>>> Researcher / PhD Student
>>>
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland Informatics Campus, Germany
>>> Building E1.3, Room 4.31
>>>
>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
>>> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
>>
>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> -- 
>>
>> Johannes Doerfert
>> Researcher / PhD Student
>>
>> Compiler Design Lab (Prof. Hack)
>> Saarland Informatics Campus, Germany
>> Building E1.3, Room 4.31
>>
>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de
>> Fax. +49 (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
>
>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list