[llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR

Hal Finkel via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 8 08:52:16 PST 2017


(adding the subject back to this thread)


On 03/08/2017 10:50 AM, Tian, Xinmin wrote:
> A quick update, we have been looking through all LLVM passes to identify the impact of "IR-region annotation", and interaction issues with the rest of LoopOpt and scalarOpt, e.g.  interaction with vectorization when you have schedule(simd:guided: 64). What are the common properties for optimizer to know on IR-region annotations. We have our implementation working from O0, O1, O2 to O3. So far, the changes we made in the existing LLVM opt passes are < 200 LOC. ClangFE to executable end-to-end linked our omp library, and we also updated our implementation using token/tag on the intrinsics based on feedback from Google (David) and Xilinx (Hongbin). I am still owe Mehdi some answers.
>
> One feedback for PIR RFC, putting "fork" into loop body does not work for loop vectorizer, it shuts down vectorization, this is the same issue as MIT's scheme.
>
> Thanks,
> Xinmin
>
>
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Wednesday, March 8, 2017 5:51 AM
> To: Johannes Doerfert <doerfert at cs.uni-saarland.de>; LLVM-Dev <llvm-dev at lists.llvm.org>
> Cc: Sanjoy Das <sanjoy at playingwithpointers.com>; Mehdi Amini <mehdi.amini at apple.com>; Chandler Carruth <chandlerc at gmail.com>; Adve, Vikram Sadanand <vadve at illinois.edu>; Tian, Xinmin <xinmin.tian at intel.com>; TB Schardl <neboat at mit.edu>; acjacob at us.ibm.com
> Subject: Re:
>
>
> On 03/08/2017 07:36 AM, Johannes Doerfert wrote:
>> <mehdi.amini at apple.com>,
>> Bcc:
>> Subject: Re: [llvm-dev] [RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR
>> extension
>> Reply-To:
>> In-Reply-To: <20170224221713.GA931 at arch-linux-jd.home>
>>
>> Ping.
>>
>>
>>
>>
>> PS.
>>
>>     Are there actually people interested in this?
>>
>>     We will continue working anyway but it might not make sense to put it
>>     on reviews and announce it on the ML if nobody cares.
> I certainly care, and will also respond to the RFC, this week or next.
>
>    -Hal
>
>>
>>
>> On 02/24, Johannes Doerfert via llvm-dev wrote:
>>> To (re)start a discussion here we published two patches to the
>>> phabricator:
>>>
>>> 1) A documentation draft (including LangRef) for the PIR instructions
>>>      and the parallel regions [0].
>>>
>>> 2) A draft implementation for a ParallelRegionInfo pass that identifies
>>>      and maintains parallel regions in the CFG [1]. It also serves as a
>>>      verifier pass for now.
>>>
>>> Please take a look and comment the patches either in the phabricator
>>> or here on the list.
>>>
>>> Thanks!
>>>
>>> [0] https://reviews.llvm.org/D30353
>>> [1] https://reviews.llvm.org/D30354
>>>
>>>
>>> On 01/28, Johannes Doerfert via llvm-dev wrote:
>>>> Dear all,
>>>>
>>>> This RFC proposes three new LLVM IR instructions to express
>>>> high-level parallel constructs in a simple, low-level fashion. For
>>>> this first stage we prepared two commits that add the proposed
>>>> instructions and a pass to lower them to obtain sequential IR. Both
>>>> patches have be uploaded for review [1, 2]. The latter patch is very
>>>> simple and the former consists of almost only mechanical changes needed to add new instructions.
>>>>
>>>> The rest of this email contains (1) an introduction of the IR
>>>> extension
>>>> (2) the reasoning behind this approach, (3) a comparison to other
>>>> ideas proposed so far, (4) a validation of the feasibility and
>>>> potential impact, and (5) an outlook on the next steps.
>>>>
>>>> (1) IR extension:
>>>> Parallel IR adds three new terminator instructions that define the
>>>> beginning and the end of parallel regions in the CFG. A parallel
>>>> region is a connected subgraph of the CFG that is potentially
>>>> executed by two threads in parallel. It can only be entered with a
>>>> fork instruction and spreads till a join instruction is reached.
>>>> Therefor parallel regions are single-entry-multiple-exit regions.
>>>> Parallel regions can be nested and if they are, they form a parallel
>>>> region tree similar to the loop tree maintained by the natural loop
>>>> info pass. Each parallel region defines two independent “sibling”
>>>> tasks, namely the forked and continuation task.
>>>>
>>>> The new instructions are defined as follows:
>>>>
>>>> 1. fork: marks the beginning of parallel region. Every fork has two
>>>>            successor blocks which represent two parallel tasks. We call
>>>>            these two “sibling” tasks the forked and continuation tasks.
>>>>            Nested forking is supported, meaning that another fork can be
>>>>            reach prior to the join.
>>>>
>>>> 2. halt: marks the end of a forked task. The "sibling" continuation block
>>>>            (see fork above) is the operand of the halt terminator. This
>>>>            represents the idea of asymmetric parallelism as introduced by
>>>>            [1]. One advantage of asymmetric parallelism is that sequential
>>>>            semantics of the program are clear from its CFG (ref. [1]).
>>>>            Note that the edge from a forked block to a continuation block
>>>>            (the one introduced by the halt) represents the control flow
>>>>            when the two successors of a fork execute sequentially, not
>>>>            when they execute in parallel. In the latter case there is no
>>>>            “control transfer” happening via this edge but only
>>>>            synchronization between the tasks.
>>>>
>>>> 3. join: marks a synchronization point and the end of a parallel region.
>>>>            Once a join terminator is reached by a thread, execution stops
>>>>            in that thread until all tasks spawned by that thread finish
>>>>            their work, thus reach their respective halt instruction.  A
>>>>            join shall only be reached by the continuation task of a fork,
>>>>            the forked task shall reach a halt with the continuation as a
>>>>            successor.
>>>>
>>>>
>>>> Here is an example of a parallel OpenMP loop and its idiomatic
>>>> lowering to Parallel IR. We set up a wiki [0] with additional examples.
>>>>
>>>> #pragma omp parallel
>>>> for(int i = 0; i < n; ++i) {
>>>>    A[i] = C[i];
>>>> }
>>>>
>>>>
>>>> preheader:
>>>>     br label %header
>>>>
>>>> header:
>>>>     %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
>>>>     %done = icmp ge %i, %n
>>>>     br i1 %done, label %exit, label %body
>>>>
>>>> body:
>>>>     fork label %task, label %latch
>>>>
>>>> task:
>>>>     %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
>>>>     %aval = load i32* %aptr
>>>>     %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
>>>>     store i32 %aval, i32* %aptr
>>>>     halt label %latch
>>>>
>>>> latch:
>>>>     %inc = add i32, i32 %i, i32 1
>>>>     br label %header
>>>>
>>>> exit:
>>>>     join label %afterloop
>>>>
>>>> afterloop:
>>>> ...
>>>>
>>>>
>>>>
>>>> (2) Reasoning:
>>>> The proposed approach is crafted such that the semantics of the
>>>> parallel program is represented correctly in almost native,
>>>> low-level IR right after front-end and preserved at any point till
>>>> the final lowering to sequential IR or parallel runtime library
>>>> calls. To this end, asymmetric parallelism is employed, a concept
>>>> that uses control flow and the common concept of dominance to
>>>> represent parts of the parallel semantics. In this model the
>>>> parallel tasks do not dominate each other and only one parallel task
>>>> dominates the code after the parallel region. As a consequence,
>>>> various transformations that would break assumptions we make about
>>>> parallel regions cannot happen (see [3,4]). While the explicitly
>>>> modeled control flow together with dominance prevents various code
>>>> motion problems, the use of terminators helps to minimize the
>>>> changes needed to educate passes about parallel regions. Only a fraction of analysis and transformation passes deal with terminators explicitly.
>>>> Most passes either test for known terminators (like branches), rely
>>>> on dominance information, or work on a basic block level. To even
>>>> further reduce changes to the existing passes, high-level concepts
>>>> are broken down to already available low-level concepts instead of
>>>> introducing new, semantically rich instructions/intrinsics (see the
>>>> last paragraph of [5] and section 4 in the PIR white paper [6] for
>>>> examples). Finally, this scheme allows a pass to simply reason about
>>>> the sequential semantics of a parallel region, transform it back to
>>>> one if needed or deemed beneficial and employ existing tooling
>>>> solutions to debug and analyze the code [7].
>>>>
>>>> (3) Comparison:
>>>> The BoF discussion sheet [8] and the recent “[RFC] on IR-level
>>>> region annotations” [9] both list pros and cons of different
>>>> proposed schemes and implementations. We summarize and comment the
>>>> discussion on the ones listen in the recent RFC here:
>>>>     (a) Metadata: It seems a consensus has been reached that metadata is
>>>>           not the solution but only a means to enhance a different solution.
>>>>     (c) One Intrinsic per directive/clause: This approach basically embeds
>>>>           a high-level (parallel) language in LLVM IR using intrinsics. It
>>>>           seems there is little to no support for this approach at the
>>>>           moment.
>>>>     (d) Parallel loop/region annotations: Here, intrinsics enclosing a
>>>>           parallel loop/region are used to represent parallelism.
>>>>           High-level knowledge is represented as attached metadata or in
>>>>           separate intrinsics. For more details please see the original
>>>>           RFC [9]. In the discussion several potential drawbacks have been
>>>>           mentioned:
>>>>           - The annotations might be too general [10].
>>>>           - The IR is not semantically correct (or ready for optimization)
>>>>             after the front-end and needs an additional “prepare phase for
>>>>             pre-privatization" [11].
>>>>           - The currently available “potential side effect for intrinsic
>>>>             calls” seem not to suffice for the proposed intrinsics as they
>>>>             do not have "call semantics" [12].
>>>>     (b) Parallel instructions (this approach): The table in the region RFC
>>>>           [9] lists two drawbacks with this approach, both of which have
>>>>           already been called into question [5]. The first drawback is the
>>>>           effort needed to implement this scheme which is discussed in
>>>>           more detail in section (4) of this mail. The second drawback is
>>>>           the need for additional representation of high-level information
>>>>           that is not part of the semantics of the new fork-join
>>>>           instructions. As mentioned above, the choice to keep the new
>>>>           instructions as simple as possible is deliberate. This parallel
>>>>           IR is intended to be extensible, and in particular, compatible
>>>>           with representations of high-level parallel concepts that might
>>>>           be developed in the future. For the time being, the parallel IR
>>>>           is compatible with approach taken today of lowering high-level
>>>>           parallel linguistics, such as reductions and private memory, to
>>>>           existing IR constructs, such as parallel-runtime calls,
>>>>           atomicrmw instructions, and well placed alloca’s [5,6]. Although
>>>>           other extensions to the IR might allow LLVM to compile these
>>>>           higher-level constructs more effectively, we see no reason the
>>>>           parallel IR would conflict with any such extensions. (On the
>>>>           contrary, the parallel IR would seem to help compiler analyses
>>>>           of higher-level parallel constructs by exposing logical
>>>>           parallelism.)
>>>>
>>>>
>>>> (4) Feasibility and Impact:
>>>> The Tapir and PIR prototypes demonstrate the feasibility of this
>>>> approach. The Tapir prototype [13] has recently proven its
>>>> robustness as the standard compiler in the MIT class on parallel
>>>> programming. It was implemented in ~ 5k LOC. However, >1k are
>>>> explicit parallel optimization, 1k is used to add new instructions
>>>> (thus mechanical) and 2k are used to lower the parallelism (basically needed for any scheme).
>>>> Only the rest is required to make it work with existing analysis and
>>>> transformation passes. While Tapir added explicit optimization
>>>> passes for parallel regions/loops, the representation allows for a
>>>> variety of classic optimizations (CSE, GVN, LICM, loop unrolling,
>>>> TRE) to work with little to no modifications. Potential speedups
>>>> compared to a classic “early-outlining” approaches can also be seen in the Tapir paper [13].
>>>> For the PIR prototype [14] we modified only three transformation
>>>> passes
>>>> (<20 LOC) [15] before we could run the O3 pipeline successfully on a
>>>> parallel matrix multiplication.
>>>>
>>>> Together, these prototypes show how little passes actually inspect
>>>> new (or “unknown”) terminators. The default assumption passes have
>>>> to make, namely that control might be transferred to any successors
>>>> at runtime, has, in terms of potential compiler transformations, a
>>>> similar effect as the parallel semantics we want to model, namely
>>>> that control is transferred to all successors.
>>>>
>>>>
>>>> (5) Outlook:
>>>> This first stage will only introduce and test the new instructions
>>>> and the sequentialization pass. Afterwards we intend to start
>>>> additions in different, partially overlapping but often orthogonal
>>>> directions. We do welcome comments as well as developers for each of them:
>>>>    - Analysis and optimization:
>>>>       * A “parallel region info” pass to keep track of parallel regions
>>>>         and their nesting. The information can be made accessible in a
>>>>         “parallel region tree” similar to the loop tree maintained by the
>>>>         loop info pass. [stage 1, immediate next goal]
>>>>       * Extension of the verifier that allow to check parallel IR for
>>>>         “well-formedness”. [stage 1, immediate next goal]
>>>>       * Documentation of the PIR instructions in the language reference.
>>>>         [stage 1, immediate next goal]
>>>>       * A cost analysis for parallel tasks that can be queried by
>>>>         optimizations. The cost model needs to take the hardware, the
>>>>         runtime library and the parallel tasks into account.
>>>>       * Vectorizer enhancements to enable the vectorization of parallel
>>>>       * loops and tasks.
>>>>       * Parallelization centric optimizations:
>>>>         a) Parallel tasks can be balanced, merged or split as well as created
>>>>            from and lowered to sequential code.
>>>>         b) Barriers can be eliminated.
>>>>         c) Parallel loops can be statically scheduled or created from
>>>>            parallel recursive calls [13]
>>>>       * Analysis to extract high-level information (reductions, private
>>>>         memory, ...) from the low-level representation.
>>>>    - Front-end:
>>>>       * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
>>>>         parallel sections and parallel loops with limited support for
>>>>         clauses (at first) (examples can be found here [1]). [milestone 1]
>>>>       * Generation of PIR code through automatic parallelization. A
>>>>         patched version of Polly exists that emits parallel loops using PIR instead of
>>>>         OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
>>>>       * Representation of more evolved high-level features like assignment
>>>>         of computation units.
>>>>    - Back-end:
>>>>       * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
>>>>         runtime library. [milestone 1]
>>>>       * A simple parallel library, e.g., based on pthreads, to be shipped
>>>>         with LLVM as a fallback implementation for parallel regions.
>>>>
>>>>
>>>> Thank you all for your time and hopefully constructive input on this proposal!
>>>>
>>>> Cheers,
>>>>     Johannes, on behalf of the PIR team
>>>>
>>>>
>>>> Disclaimer:
>>>> This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
>>>> Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
>>>> University), Simon Moll (Saarland University) and myself. However,
>>>> ideas and feedback came from many people, including the members of
>>>> the LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian,
>>>> ...), the participants in the BoF at the US Developers’ meeting,
>>>> everybody that commented on the BoF discussion sheet [16] and the
>>>> recent RFC on IR-level region annotations [9] (Mehdi Amini, Sanjoy
>>>> Das, Daniel Berlin, ...).
>>>>    
>>>>
>>>>
>>>> [0] https://github.com/Parallel-IR/llvm-pir/wiki
>>>> [1] https://reviews.llvm.org/D29250
>>>> [2] https://reviews.llvm.org/D29251
>>>> [3]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
>>>> [4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
>>>> [5]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
>>>> [6]
>>>> http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
>>>> [7] http://supertech.csail.mit.edu/papers/spbags.pdf  &
>>>> www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
>>>> [8] https://goo.gl/Blp2Xr
>>>> [9]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
>>>> [10]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
>>>> [11]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
>>>> [12]
>>>> http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
>>>> [13] http://wsmoses.com/tapir.pdf
>>>> [14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
>>>> [15]
>>>> https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e
>>>> 52547758c7be0618a
>>>> [16] https://goo.gl/wKps3c
>>>>
>>>>
>>>> --
>>>>
>>>> Johannes Doerfert
>>>> Researcher / PhD Student
>>>>
>>>> Compiler Design Lab (Prof. Hack)
>>>> Saarland Informatics Campus, Germany Building E1.3, Room 4.31
>>>>
>>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49
>>>> (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> --
>>>
>>> Johannes Doerfert
>>> Researcher / PhD Student
>>>
>>> Compiler Design Lab (Prof. Hack)
>>> Saarland Informatics Campus, Germany
>>> Building E1.3, Room 4.31
>>>
>>> Tel. +49 (0)681 302-57521 : doerfert at cs.uni-saarland.de Fax. +49
>>> (0)681 302-3065  : http://www.cdl.uni-saarland.de/people/doerfert
>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list