<div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hey LLVM Dev,</div><div><br></div><div>We in the LLVMPar working group have been working hard on this RFC, which we hope to discuss on the list and during the upcoming dev meeting.  It's long, and the associated Google doc is even longer, but we hope you'll take a look and let us know your thoughts and feedback.</div><div><br></div><div>Cheers,</div><div>The LLVMPar working group</div><div><br></div><div>--</div><div><br></div><div><div>    RFC: An Extension Mechanism for Parallel Compilers Based on LLVM</div><div><br></div><div>              Vikram Adve, Hal Finkel, Maria Kotsifakou,</div><div>            Tao (TB) Schardl, George Stelle and Xinmin Tian</div><div><br></div><div>INTRODUCTION</div><div><br></div><div>This RFC proposes a lightweight extension mechanism (three new intrinsics)</div><div>to enable full-fledged parallel compilers to be built "on top of" LLVM IR</div><div>and be able to invoke LLVM analyses, optimizations, and back-end code</div><div>generators, while minimizing (though not entirely avoiding) the need to make</div><div>changes to the existing LLVM infrastructure. </div><div><br></div><div>The context for this RFC is that there is no high-quality, open-source</div><div>parallel compiler infrastructure that is comparable to LLVM in</div><div>retargetability, robustness and community support.  LLVM IR itself has only</div><div>limited support for expressing parallelism, mainly consisting of vector</div><div>instructions and concurrency mechanisms (atomics, fences, and a memory</div><div>model) that support multithreaded programs. Today's parallel systems and</div><div>languages are extremely diverse: parallel systems include multicore CPUs,</div><div>SIMD vector units, GPUs, FPGAs, and a variety of programmable </div><div>domain-specific accelerators, while parallel languages span a wide range of</div><div>general-purpose choices -- e.g,. OpenMP, OpenACC, OpenCL, CUDA, Cilk -- and</div><div>domain-specific choices -- e.g., TensorFlow, MXNet, Halide. This diversity</div><div>makes it challenging to extend LLVM directly into a retargetable parallel IR.</div><div><br></div><div>Instead, the extension mechanism proposed in this RFC provides the</div><div>foundation to build parallel compilers as a separate extension of the LLVM</div><div>IR. The section Design Goals describes five high-level design goals that any</div><div>such extension mechanism must satisfy.  The section Correctness Properties</div><div>describes several key parallel correctness properties that must be</div><div>enforceable in the face of LLVM's existing analyses and transformations.</div><div>The Proposed Extension Mechanism section describes a simple extension</div><div>mechanism, based on three intrinsic functions plus the use of operand</div><div>bundles, for demarcating code regions and locations in a program where a</div><div>parallel IR would "hook" into the LLVM IR. The Soundness section argues why</div><div>the extension mechanism preserves the correctness properties described</div><div>previously.  Finally the Required Changes to LLVM section summarizes the</div><div>changes to existing LLVM passes that will be needed to ensure these</div><div>soundness properties are enforced.</div><div><br></div><div>An online Google doc (<a href="https://bit.ly/LLVMParRFC">https://bit.ly/LLVMParRFC</a>) containing this RFC</div><div>includes an additional Appendix (too long to post here) that describes</div><div>how three different parallel compilers -- an OpenMP compiler [3],</div><div>Tapir [4] and HPVM [6] -- can be expressed using this extension</div><div>mechanism. Intel's OpenMP compiler uses this approach for a</div><div>full-fledged implementation of OpenMP 4.5.</div><div><br></div><div>We plan to discuss this RFC at the upcoming US LLVM Developers'</div><div>Meeting during the BoF, "Ideal versus Reality: Optimal Parallelism and</div><div>Offloading Support in LLVM."</div><div><br></div><div>DESIGN GOALS</div><div><br></div><div>The LLVM extension proposed here enables a separate parallel IR (or parallel</div><div>compiler) to be defined that "builds on" the existing LLVM IR by using LLVM</div><div>IR for sequential and optionally vector operations, while using separate new</div><div>constructs for expressing parallelism.  We outline five high-level goals in</div><div>extending LLVM to support parallelism in this manner:</div><div><br></div><div>1. The extension should be generally useful for handling modern parallel</div><div>   languages and parallel hardware.</div><div><br></div><div>2. The extension should have a minimal impact on existing analyses,</div><div>    transformations, code generators, and mainline client tools like</div><div>    front-ends, linkers, sanitizers and debuggers.</div><div><br></div><div>3. The extension should correctly express any restrictions required to</div><div>    preserve parallel semantics (of some parallel IR) in the face of LLVM's</div><div>    analyses and code transformations.</div><div><br></div><div>4. The extension should not inhibit effective compiler analysis and</div><div>    transformations of LLVM code fragments used by parallel, vector and</div><div>    offloading code.</div><div><br></div><div>5. The extension should support effective and efficient debugging of</div><div>    analyses and code transformations applied to parallel codes.</div><div><br></div><div>Although this document focuses on parallel compilers built on top of the</div><div>proposed extension mechanism, the mechanism itself is not limited to</div><div>parallelism: it could be used for introducing other kinds of information</div><div>into LLVM IR, e.g., for prefetching, vectorization, etc. (In fact, for this</div><div>reason, we considered and decided not to use "llvm.par." as the prefix for</div><div>the intrinsics.)</div><div><br></div><div><br></div><div>CORRECTNESS PROPERTIES</div><div><br></div><div>Code using the extension mechanism essentially needs to behave like ordinary</div><div>LLVM, while allowing specific parallel IRs layered on LLVM to be expressed</div><div>correctly.  This section informally lists and explains the requirements the</div><div>extension mechanism must satisfy in order to achieve its design goals.  The</div><div>sections Soundness and Required Changes to LLVM describe how the proposed</div><div>extension mechanism and associated changes to LLVM can support these</div><div>properties.</div><div><br></div><div>We use the following terminology:</div><div><br></div><div>o "Parallel code" : LLVM code using the extension mechanism.</div><div><br></div><div>o "Parallel IR" : An IR that uses the extension mechanism to embed new</div><div>  (presumably parallel) IR constructs and semantics into LLVM code.</div><div><br></div><div>o "Region" : A single-entry, single-exit section of LLVM code.</div><div><br></div><div>o "Marker" : An operation used by the extension mechanism to embed operations</div><div>  with opaque external semantics into LLVM IR.  A marker may be used at the</div><div>  entry or exit of a region, or to embed an operation within a region.</div><div><br></div><div>A KEY ASSUMPTION UNDERLYING THIS RFC is that new parallel IRs defined with</div><div>this mechanism focus on parallel speedup but not on arbitrary concurrency,</div><div>i.e., that it is correct to execute any parallel program sequentially using</div><div>a single thread.  This is generally true when the primary goal of</div><div>parallelism is to improve performance, but may not be true when the program</div><div>must enable "simultaneous" execution of multiple communicating events, e.g.,</div><div>many multi-threaded interactive programs and programs using threads for</div><div>asynchronous I/O.  Both parallel and concurrent programs can be correctly</div><div>compiled by the existing LLVM infrastructure (through support for</div><div>synchronization, atomics, and a concurrent memory model), but the</div><div>infrastructure lacks any information about parallel semantics of the code</div><div>being compiled.  The goal of this RFC is to enable better parallel program</div><div>optimizations, which are not well supported in the current infrastructure</div><div>because they require being aware of and reasoning about parallel semantics.</div><div>Examples of such analyses and optimizations include "may-happen-in-parallel"</div><div>analysis, identifying redundant critical sections, eliding barriers, or</div><div>transforming loop nests to systematically improve outer-loop parallelization</div><div>or inner-loop vectorization or both.</div><div><br></div><div>The extension mechanism defined below uses "regions" to mark parallel tasks,</div><div>which means that a number of restrictions are required on LLVM code</div><div>transformations that move code across region boundaries.  These restrictions</div><div>must be enforceable within the mainline LLVM infrastructure, with no (or</div><div>very few) modifications to standard LLVM transformation passes, e.g., by</div><div>requiring the parallel IR to add pointer arguments to prevent code motion of</div><div>aliasing loads and stores.</div><div><br></div><div>We categorize the restrictions that the mechanism must enforce into (a)</div><div>structural properties, (b) control flow properties, (c) dataflow properties,</div><div>and (d) synchronization and memory model properties.</div><div><br></div><div>(A) Structural properties:</div><div><br></div><div>   1. The extension mechanism must be legal LLVM, i.e., must use standard</div><div>      LLVM operations (e.g., intrinsics, values, instructions, operand</div><div>      bundles, etc.)  that are recognized by the LLVM infrastructure.</div><div>   </div><div>   2. Code containing the extension mechanism must be correctly compiled by</div><div>      LLVM passes and supported LLVM back-ends, preferably with minimal</div><div>      changes to the existing infrastructure (although some initial changes</div><div>      are unavoidable).</div><div>   </div><div>(B) Control flow properties:</div><div><br></div><div>   1. The mechanism marks the beginning and end of regions, which are</div><div>      single-entry single-exit sections of LLVM code.</div><div>   </div><div>   2. If a region may be executed multiple times, e.g., for a parallel loop,</div><div>      this must be explicit in the LLVM IR.</div><div>   </div><div>   3. It must be POSSIBLE to prevent code motion of memory</div><div>      and synchronization operations across any of the three intrinsic..</div><div>   </div><div>   4. Critical edges due to conditional branches fed by the</div><div>      llvm.directive.marker() intrinsic should not be split.</div><div><br></div><div>(C) Dataflow and data dependence properties:</div><div><br></div><div>   1. SSA values defined in a region cannot be used outside the region.</div><div>   </div><div>   2. It must be POSSIBLE to restrict memory operations -- including loads,</div><div>      stores, and allocations -- from being moved across any of the three</div><div>      intrinsics by LLVM passes. In particular, a parallel IR should</div><div>      be able to choose whether or not to enforce this restriction for any</div><div>      particular memory operation at any particular region boundary.</div><div>   </div><div>      Rationale: The restriction -- and the freedom to relax it -- may both</div><div>      be important for several reasons, e.g.,</div><div>         * To avoid moving alloca's outside of parallel tasks.</div><div>         * To avoid introducing new loop-carried dependencies.</div><div>         * To avoid introducing global memory accesses in parallel without</div><div>           proper synchronization.</div><div>         * Some memory optimizations may be important for performance and</div><div>           should not be inhibited, e.g., hoisting a load of a</div><div>           loop-invariant value out of a parallel loop.</div><div>   </div><div>   3. It must be POSSIBLE to prevent new (SSA) pointer variables --</div><div>      including allocations, function arguments, return values, loads, GEPs,</div><div>      and casts to pointers -- from being introduced into a function</div><div>      containing an extension operation, unless those pointer variables are</div><div>      added as arguments to the appropriate extension operations.</div><div><br></div><div>(D) Synchronization and memory model properties:</div><div><br></div><div>    1. LLVM synchronization and concurrency constructs, e.g., load atomic,</div><div>       store atomic, atomicrmw, cmpxchg, and fence, can be used within</div><div>       parallel regions.</div><div><br></div><div>    2. It must be POSSIBLE to restrict synchronization operations --</div><div>       including atomics, fences, and volatile memory accesses -- from being</div><div>       moved across any of the three intrinsics by LLVM passes.</div><div><br></div><div><br></div><div>PROPOSED EXTENSION MECHANISM</div><div><br></div><div>Our proposed extension mechanism makes use of intrinsics and tokens,</div><div>OperandBundles, and two translators that parallel compilers can define for</div><div>their specific needs and workflows.  This infrastructure is flexible enough</div><div>to support representation of explicit parallel IRs, and also of source-level</div><div>directives for parallelization, vectorization, and offloading of both loops</div><div>and more-general code regions.</div><div><br></div><div>We avoid using LLVM metadata to express the extension mechanisms.  Instead,</div><div>based on the work done in [1][2][3][4] and on feedback from the LLVM</div><div>development community, we leverage the LLVM token and OperandBundle</div><div>constructs together with three new LLVM intrinsic functions.</div><div><br></div><div>The updated LLVM IR proposal is summarized below.</div><div><br></div><div>-------LLVM Intrinsic Functions-------</div><div><br></div><div>Essentially, the LLVM OperandBundles, the LLVM token type, and three new</div><div>LLVM directive intrinsics form the foundation of the proposed extension</div><div>mechanism.</div><div><br></div><div>The three newly introduced LLVM intrinsic functions are the following:</div><div><br></div><div>    token @llvm.directive.region.entry()[]</div><div>    token @llvm.directive.region.entry()[]</div><div>    i1 @llvm.directive.marker()[]</div><div><br></div><div>More concretely, these intrinsics are defined using the following declarations:</div><div><br></div><div>    // Directive and Qualifier Intrinsic Functions</div><div>    def int_directive_region_entry : Intrinsic<[llvm_token_ty],[], []>;</div><div>    def int_directive_region_exit : Intrinsic<[], [llvm_token_ty], []>;</div><div>    def int_directive_marker : Intrinsic <[llvm_i1_ty], [], []>;</div><div><br></div><div>As described in Section SOUNDNESS, several correctness properties are</div><div>maintained using OperandBundles on calls to these intrinsics.  In</div><div>LLVM, an OperandBundle has a tag name (a string to identify the</div><div>bundle) and an operand list consisting of zero or more operands. For</div><div>example, here are two OperandBundles:</div><div><br></div><div>    "TagName01"(i32 *%x, f32 *%y, 7)</div><div>    "AnotherTagName"()</div><div><br></div><div>The tag name of the first bundle is "TagName01", and it has an operand list</div><div>consisting of three operands, %x, %y, and 7. The second bundle has a tag</div><div>name "AnotherTagName" but no operands (it has an empty operand list).</div><div><br></div><div>The above new intrinsics allow:</div><div>* Annotating a code region marked with directives / pragmas / explicit</div><div>  parallel function calls.</div><div>* Annotating values associated with the region (or loops), that is, those</div><div>  values associated with directives / pragmas.</div><div>* Providing information on LLVM IR transformations needed for the annotated</div><div>  code regions (or loops).</div><div>* Introducing parallel IR constructs for (one of) a variety of different</div><div>  parallel IRs, e.g., Tapir or HPVM.</div><div>* Most LLVM scalar and vector analyses and optimizations to be applied to</div><div>  parallel code without modifications to the passes, and without requiring</div><div>  parallel "tasks" to be outlined into separate, isolated functions.</div><div><br></div><div>These intrinsics can be used both by frontends and also by transformation</div><div>passes (e.g. automated parallelization).</div><div><br></div><div>The names used here are open to discussion.</div><div><br></div><div>--------Three Example Uses---------</div><div><br></div><div>Below, we show three very brief examples using three IRs: OpenMP [5],</div><div>Tapir [4] and HPVM [6].  Somewhat larger code examples are shown in the</div><div>Appendix of the accompanying Google Doc.</div><div><br></div><div><br></div><div>----Tapir IR----</div><div><br></div><div>; This simple Tapir loop uniformly scales each element of a vector of</div><div>; integers in parallel.</div><div><a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a>:</div><div> %wide.trip.count = zext i32 %n to i64</div><div> br label %pfor.detach</div><div><br></div><div>pfor.detach:                          ; preds = %pfor.inc, %<a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a></div><div> %indvars.iv = phi i64 [ 0, %<a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a> ], [ %indvars.iv.next, %pfor.inc ]</div><div> detach label %pfor.body, label %pfor.inc</div><div><br></div><div>pfor.body:                            ; preds = %pfor.detach</div><div> %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv</div><div> %0 = load i32, i32* %arrayidx, align 4</div><div> %mul3 = mul nsw i32 %0, %a</div><div> store i32 %mul3, i32* %arrayidx, align 4</div><div> reattach label %pfor.inc</div><div> </div><div>pfor.inc:                             ; preds = %pfor.body, %pfor.detach</div><div> %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1</div><div> %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count</div><div> br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach</div><div><br></div><div>pfor.cond.cleanup:                    ; preds = %pfor.inc</div><div> sync label %sync.continue</div><div><br></div><div>---Tapir using LLVMPar intrinsics-----</div><div><br></div><div>; This simple parallel loop uniformly scales each element of a vector of</div><div>; integers.</div><div><a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a>:                    ; preds = %entry</div><div>  %wide.trip.count = zext i32 %n to i64</div><div>  br label %pfor.detach</div><div><br></div><div>pfor.detach:                          ; preds = %pfor.inc, %<a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a></div><div>  %indvars.iv = phi i64 [ 0, %<a href="http://pfor.detach.lr.ph">pfor.detach.lr.ph</a> ], [ %indvars.iv.next, %pfor.inc ]</div><div>  %c = call i1 @llvm.directive.marker()["detach_task"]</div><div>  br i1 %c, label %pfor.body, label %pfor.inc</div><div><br></div><div>pfor.body:                            ; preds = %pfor.detach</div><div>  %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv</div><div>  %0 = load i32, i32* %arrayidx, align 4</div><div>  %mul3 = mul nsw i32 %0, %a</div><div>  store i32 %mul3, i32* %arrayidx, align 4</div><div>  call i1 @llvm.directive.marker()["reattach_task"]</div><div>  br label %pfor.inc</div><div> </div><div>pfor.inc:                             ; preds = %pfor.body, %pfor.detach</div><div>  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1</div><div>  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count</div><div>  br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach</div><div><br></div><div>pfor.cond.cleanup:                    ; preds = %pfor.inc, %entry</div><div>  call i1 @llvm.directive.marker()["local_barrier"]</div><div>  br label %sync.continue  </div><div><br></div><div><br></div><div>Comment: If necessary, one can prevent hoisting of the getelementptr</div><div>instruction %arrayidx or the load instruction %0 in the above example</div><div>using the intrinsics @llvm.directive.region.entry,</div><div>@llvm.directive.region.exit, and @llvm.launder.invariant.group</div><div>intrinsics appropriately within pfor.body.</div><div><br></div><div><br></div><div>----HPVM----</div><div><br></div><div>; The function vector_add() performs point to point addition of incoming </div><div>; arguments, A and B, replicated at run-time across N parallel instances.</div><div>; We omit dataflow edges showing incoming/outgoing values.</div><div>; </div><div>     %node = call i8* @llvm.hpvm.createNode1D(</div><div>     <span style="white-space:pre">  </span>i8* bitcast %retStruct (i32*, i32, i32*, i32, i32*, i32) @vector_add</div><div>     <span style="white-space:pre">  </span> <span style="white-space:pre">    </span>    to i8*,</div><div>     <span style="white-space:pre"> </span>i32 %N)</div><div><br></div><div><br></div><div>----HPVM using LLVMPar intrinsics----</div><div><br></div><div>   ...           ; code using A, B, C, N</div><div>   ; The HPVM node function @vector_add is now inlined</div><div>   %region = call token @llvm.directive.region.entry()[</div><div>      "HPVM_create_node"(%N), </div><div>      "dataflow_values"(i32* %A, i32 %bytesA, i32* %B, i32 %bytesB, </div><div>      <span style="white-space:pre">                     </span>i32* %C, i32 %bytesC),</div><div>      "attributes"(i32 0, i32 -1, i32 0, i32 -1, i32 1, i32 -1) ]</div><div>    ; 0 = ‘in', 1 = ‘out', 2 = ‘inout', -1 for non pointer arguments</div><div>  </div><div>; Loop structure corresponding to %N instances of vector_add()</div><div>%header: ...</div><div>    ; parallel loop with trip count %N, index variable %loop_index</div><div>    %loop_index = phi i64 [ 0, %preheader ], [ %loop_index.next, %latch ]    </div><div>    %c = call i1 @llvm.directive.marker()["detach_task"] </div><div>    br i1 %c, label %body, label %latch</div><div><br></div><div>%body:</div><div>    ; Loop index, instead of HPVM intrinsic calls to generate index</div><div>    %ptrA = getelementptr i32, i32* %A, i32 %loop_index</div><div>    %ptrB = getelementptr i32, i32* %B, i32 %loop_index</div><div>    %ptrC = getelementptr i32, i32* %C, i32 %loop_index</div><div><br></div><div>    %a = load i32, i32* %ptrA</div><div>    %b = load i32, i32* %ptrB</div><div>    %c = add i32, i32 %a, i32 %b   </div><div>    store i32 %c, i32* %ptrC </div><div><br></div><div>    %ignore = call i1 @llvm.directive.marker()["reattach_task"]</div><div>    br label %latch</div><div><br></div><div>%latch:   </div><div>    %loop_index.next = add nuw nsw i64 %loop_index, 1</div><div>    %exitcond = icmp eq i64 %loop_index.next, %N</div><div>    br i1 %exitcond, label %loop.end, label %header</div><div><br></div><div>%loop.end:</div><div>    call void @llvm.directive.region.exit(token %region)[</div><div>    <span style="white-space:pre">    </span> "HPVM_create_node"(), "dataflow_values" () ]</div><div><br></div><div>    ...        ; code using A, B, C, N</div><div><br></div><div>________________</div><div><br></div><div><br></div><div>PROPOSED WORKFLOWS</div><div><br></div><div>The proposed intrinsics can be used in a variety of compiler designs.  We</div><div>illustrate two possible designs, the first used by the Intel OpenMP compiler</div><div>and the second to be used by a planned port of Tapir+HPVM to these</div><div>intrinsics.  For each of these workflows, the extension mechanism supports<br>two translators:</div><div><br></div><div>PREPARE: This is used to produce LLVM IR containing the intrinsics. This</div><div>    must ensure that the initial generated IR with intrinsics satisfies the</div><div>    properties listed above.</div><div><br></div><div>EXTRACT: This is used to convert LLVM IR containing the intrinsics into</div><div>    some parallel IR, e.g., WRegion for Intel's OpenMP compiler, or Tapir+HPVM</div><div>    for the second compiler.</div><div><br></div><div>These two are referred to as translators, not passes, because their order is</div><div>not interchangeable with other LLVM passes.</div><div><br></div><div>The first workflow, sketched below, uses the front-end (e.g., Clang for</div><div>OpenMP) to generate LLVM IR containing the intrinsics directly. The PREPARE</div><div>translator runs at the end of the front-end, to ensure the required</div><div>properties.  Standard LLVM passes can operate on this IR, before converting</div><div>the EXTRACT translator generates the parallel IR.  In principle, this</div><div>parallel IR could later be converted back into LLVM IR with intrinsics, but</div><div>this may or may not be possible (e.g., the WRegion IR is not converted back</div><div>and it is considered difficult to do so).</div><div><br></div><div>         Front-end</div><div>            +               LLVM</div><div>         PREPARE            Passes             EXTRACT          Backend</div><div>Parallel -------> LLVM IR + ------> Optimized ------> Parallel ------> Target</div><div>source            extensions        LLVM IR +           IR             Code</div><div>program                             extensions </div><div>                      ^                                  |</div><div>                      |                                  |</div><div>                      |__________________________________|</div><div>                                PREPARE (optional)</div><div><br></div><div>                        Workflow 1: E.g., OpenMP Compiler</div><div><br></div><div><br></div><div>The second workflow first generates a parallel compiler's IR (e.g., Tapir or</div><div>HPVM) directly from some front-end.  The PREPARE phase converts the parallel</div><div>compiler IR into LLVM IR with the intrinsics, while ensuring that this</div><div>initial IR adheres to the required correctness properties.  Standard LLVM</div><div>passes can then be run on this IR. Eventually, the EXTRACT translator</div><div>converts the IR back to the parallel compiler IR. In this workflow, it</div><div>should be possible to convert back and forth between Parallel IR and</div><div>Extended LLVM IR multiple times using the PREPARE and EXTRACT phases each</div><div>time.</div><div>  </div><div>                                  LLVM</div><div>         FE       PREPARE         Passes          EXTRACT         Backend</div><div>Parallel --> Para- ---> LLVM IR + ----> Optimized -----> Parallel -----> Target</div><div>source       llel      extensions       LLVM IR +          IR            Code</div><div>program      IR                                         extensions </div><div>                           ^                                |</div><div>                           |                                |</div><div>                           |________________________________|</div><div>                                    PREPARE (optional)</div><div><br></div><div>                        Workflow 2: E.g., Tapir, HPVM Compilers</div><div><br></div><div><br></div><div>The key difference between these two workflows is that the first one may be</div><div>used to embed source-level information (e.g., OpenMP source directives)</div><div>directly within the LLVM IR, whereas in the second one, the Parallel IR</div><div>is embedded within LLVM IR rather than some source-level information.</div><div><br></div><div><br></div><div>SOUNDNESS</div><div><br></div><div>This section examines what facilities the proposed extension mechanism</div><div>provides to maintain the necessary correctness properties for parallel IR's.</div><div>The section examines each correctness property in turn.  We see that some</div><div>changes to mainline LLVM passes are needed to use these facilities and</div><div>maintain the correctness properties.  The section "Required Changes to LLVM"</div><div>describes what changes to LLVM passes are needed to use these facilities to</div><div>maintain the correctness properties.</div><div><br></div><div><br></div><div>(A) Structural Properties</div><div><br></div><div>    1. Follows directly from the fact that the extension mechanism is LLVM</div><div>       intrinsics.</div><div><br></div><div>    2. Given the correct use of the tools described in the rest of the</div><div>       section, follows directly from (i).</div><div><br></div><div>(B) Control Flow Properties</div><div><br></div><div>    1. The tokens returned by a region entry must be passed to the matching</div><div>       region exit, which enforces single-entry regions by ensuring that a</div><div>       region entry dominates all region exits.  If a particular parallel IR</div><div>       needs it, the PREPARE and EXTRACT translators are responsible for</div><div>       enforcing the single exit property.</div><div><br></div><div>    2. This is achieved using Tapir-style use of looping with each iteration</div><div>       spawning a parallel execution of the body. See the Tapir loop</div><div>       examples for an example of how this can be implemented.</div><div><br></div><div>    3. To prevent code motion of memory operations across region boundaries,</div><div>       one can use pointer arguments and the OperandBundles described in the</div><div>       previous section.  Only pointers to local allocations and internal</div><div>       globals need to be passed to relevant intrinsics.</div><div><br></div><div>    4. Would require changes to the SplitCriticalEdge utility function in</div><div>       LLVM.[b]</div><div><br></div><div>(C) Dataflow and data dependence properties</div><div><br></div><div>    1. There are two parts to ensuring this property. The first is that</div><div>       correct usage of the marker intrinsic gives the implementer control</div><div>       over dominator relations, allowing the implementer to prevent direct</div><div>       references of SSA values defined in a parallel region. Second, these</div><div>       values can still be referenced by phi nodes, and so changes must be</div><div>       made to LLVM to prevent such phi node creation.[b]</div><div><br></div><div>    2. Prevention of particular memory operations being moved across</div><div>       intrinsics is achieved by appropriate use of OperandBundles.</div><div><br></div><div>    3. Would likely require changes to LLVM, as described in section</div><div>       Required Changes to LLVM.[b]</div><div><br></div><div>(D) Synchronization and memory model properties</div><div><br></div><div>    1. Follows directly from A(i).</div><div><br></div><div>    2. Mechanism is identical to C(ii) for memory operations specific to</div><div>       particular locations, because most synchronization operations take a</div><div>       pointer argument.  For memory-wide operations (which don't take such</div><div>       an argument), such as fences, as long as any pointer is passed to an</div><div>       OperandBundle, LLVM will be prohibited from moving a fence across the</div><div>       intrinsic calls, as desired.</div><div><br></div><div>REQUIRED CHANGES TO LLVM</div><div><br></div><div>The soundness properties above rely almost entirely on standard correctness</div><div>requirements for compiler transformations, e.g., transformations, including</div><div>IPO passes, will not move memory or synchronization operations across calls</div><div>to opaque external functions that can access an aliasing object; opaque</div><div>external functions could access any memory object reachable from any</div><div>(non-internal) global pointer or argument pointer.</div><div><br></div><div>Some properties, however, require additional changes to a few LLVM</div><div>passes.  We present a high-level list of such changes known to us so</div><div>far.  This list is a work-in-progress, based on our experience</div><div>developing parallel compilers.  We invite feedback on the content of</div><div>this list.  Patches will be submitted to the list, after discussion on</div><div>this RFC.</div><div><br></div><div>* New allocations and new internal globals must add pointer arguments to</div><div>  intrinsics: To enforce properties B(iii) and C(iii), passes that introduce</div><div>  new local allocations and new internal globals will need to be modified to</div><div>  add pointer arguments to the relevant intrinsics.  There aren't too many</div><div>  such passes today: Inliner, Internalize, Reg2Mem, SROA.</div><div><br></div><div>   * Note: Some optional passes not used by current parallel compilers may</div><div>     be omitted, such as StatePointRewriting (if GC is not needed), the</div><div>     Sanitizers, etc.</div><div><br></div><div>* Inliner must not move allocas to function entry: When a callee that</div><div>  contains an alloca is inlined at a call site, the alloca must be kept at</div><div>  the call-site and not moved to function entry. (Technically, this is only</div><div>  necessary for call sites within parallel regions in the caller function,</div><div>  but we can change it for all call sites so that the inliner doesn't have</div><div>  to look for enclosing regions.)</div><div><br></div><div>   * Note: Because of this change, passes that aim to optimize allocas must</div><div>     be modified to look for these inlined allocas.</div><div><br></div><div>* PHI-node construction: Enforcing property C(i) requires</div><div>  modifications to how PHI nodes are constructed in LLVM.  From our</div><div>  experience, the bulk of the necessary changes lie in Mem2Reg pass</div><div>  and SSAUpdater.  But a couple passes, such as GVN, update PHI nodes</div><div>  themselves in simple ways, and some changes might be required to</div><div>  these passes.</div><div><br></div><div>* SplitCriticalEdge utility: To enforce B(iv), modifications are</div><div>  needed to prevent splitting critical edges due to branches</div><div>  conditioned on @llvm.directive.marker values.  This change can be</div><div>  contained to the SplitCriticalEdge utility function, which controls</div><div>  critical-edge splitting.</div><div><br></div><div><br></div><div>AUTHORS AND ACKNOWLEDGEMENTS</div><div><br></div><div>The initial design of the three intrinsics was developed by Xinmin Tian and</div><div>Hal Finkel, and prototyped in the Intel OpenMP compiler.</div><div><br></div><div>The discussions and writing for this document, including the correctness</div><div>requirements, soundness arguments and examples, were led by Vikram Adve, Hal</div><div>Finkel, Maria Kotsifakou, Tao (TB) Schardl, George Stelle and Xinmin Tian.</div><div><br></div><div>Joel Denny, Johannes Doerfert, Seyong Lee, William Moses, Hashim Sharif, Jeff</div><div>Vetter and Wael Yahia provided valuable input during the discussions and</div><div>significant comments on the RFC.</div><div><br></div><div><br></div><div>REFERENCES</div><div><br></div><div>1. H. Finkel and X. Tian "[llvm-dev] RFC: A Proposal for adding an</div><div>   experimental IR-level region-annotation infrastructure", Jan. 11, 2017.</div><div>   <a href="http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html">http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html</a>.</div><div><br></div><div>2. H. Saito, <a href="http://et.al">et.al</a>., "Extending LoopVectorizer towards supporting OpenMP 4.5</div><div>   SIMD and outer loop auto-vectorization", LLVM Developer’s Conference,</div><div>   Nov. 2016</div><div><br></div><div>3. X. Tian, H. Saito, E. Su, A. Gaba, M. Masten, E. Garcia, A. Zaks, "LLVM</div><div>   Framework and IR Extensions for Parallelization, SIMD Vectorization and</div><div>   Offloading".  LLVM-HPC@SC 2016: 21-31.</div><div><br></div><div>4. T.B. Schardl, W.S. Moses, C.E. Leiserson, "Tapir: Embedding Fork-Join</div><div>   Parallelism into LLVM's Intermediate Representation", In PPoPP, 2017.</div><div>   <a href="https://doi.org/10.1145/3018743.3018758">https://doi.org/10.1145/3018743.3018758</a></div><div><br></div><div>5. Intel Corporation, "LLVM Intrinsic function and Tag name string interface</div><div>   specification for directive representation", January 22, 2018.</div><div><br></div><div>6. M. Kotsifakou, P. Srivastava, M.D. Sinclair, R. Komuravelli, V. Adve,</div><div>   S. Adve.  "HPVM: Heterogeneous Parallel Virtual Machine."  In PPoPP,</div><div>   2018. <a href="https://doi.org/10.1145/3178487.3178493">https://doi.org/10.1145/3178487.3178493</a></div><div><br></div><div><br></div><div>APPENDIX</div><div><br></div><div>See Appendix in the Google doc at <a href="https://bit.ly/LLVMParRFC">https://bit.ly/LLVMParRFC</a></div><div><br></div><div>This Appendix includes three substantial examples of using this RFC for</div><div>OpenMP, Tapir and HPVM compilers.</div></div><div><br></div></div></div></div></div></div>