[llvm-dev] RFC: An Extension Mechanism for Parallel Compilers Based on LLVM

Mon Oct 15 21:44:39 PDT 2018

Hey LLVM Dev,

We in the LLVMPar working group have been working hard on this RFC, which
we hope to discuss on the list and during the upcoming dev meeting.  It's
long, and the associated Google doc is even longer, but we hope you'll take
a look and let us know your thoughts and feedback.

Cheers,
The LLVMPar working group

--

    RFC: An Extension Mechanism for Parallel Compilers Based on LLVM

              Vikram Adve, Hal Finkel, Maria Kotsifakou,
            Tao (TB) Schardl, George Stelle and Xinmin Tian

INTRODUCTION

This RFC proposes a lightweight extension mechanism (three new intrinsics)
to enable full-fledged parallel compilers to be built "on top of" LLVM IR
and be able to invoke LLVM analyses, optimizations, and back-end code
generators, while minimizing (though not entirely avoiding) the need to make
changes to the existing LLVM infrastructure.

The context for this RFC is that there is no high-quality, open-source
parallel compiler infrastructure that is comparable to LLVM in
retargetability, robustness and community support.  LLVM IR itself has only
limited support for expressing parallelism, mainly consisting of vector
instructions and concurrency mechanisms (atomics, fences, and a memory
model) that support multithreaded programs. Today's parallel systems and
languages are extremely diverse: parallel systems include multicore CPUs,
SIMD vector units, GPUs, FPGAs, and a variety of programmable
domain-specific accelerators, while parallel languages span a wide range of
general-purpose choices -- e.g,. OpenMP, OpenACC, OpenCL, CUDA, Cilk -- and
domain-specific choices -- e.g., TensorFlow, MXNet, Halide. This diversity
makes it challenging to extend LLVM directly into a retargetable parallel
IR.

Instead, the extension mechanism proposed in this RFC provides the
foundation to build parallel compilers as a separate extension of the LLVM
IR. The section Design Goals describes five high-level design goals that any
such extension mechanism must satisfy.  The section Correctness Properties
describes several key parallel correctness properties that must be
enforceable in the face of LLVM's existing analyses and transformations.
The Proposed Extension Mechanism section describes a simple extension
mechanism, based on three intrinsic functions plus the use of operand
bundles, for demarcating code regions and locations in a program where a
parallel IR would "hook" into the LLVM IR. The Soundness section argues why
the extension mechanism preserves the correctness properties described
previously.  Finally the Required Changes to LLVM section summarizes the
changes to existing LLVM passes that will be needed to ensure these
soundness properties are enforced.

An online Google doc (https://bit.ly/LLVMParRFC) containing this RFC
includes an additional Appendix (too long to post here) that describes
how three different parallel compilers -- an OpenMP compiler [3],
Tapir [4] and HPVM [6] -- can be expressed using this extension
mechanism. Intel's OpenMP compiler uses this approach for a
full-fledged implementation of OpenMP 4.5.

We plan to discuss this RFC at the upcoming US LLVM Developers'
Meeting during the BoF, "Ideal versus Reality: Optimal Parallelism and
Offloading Support in LLVM."

DESIGN GOALS

The LLVM extension proposed here enables a separate parallel IR (or parallel
compiler) to be defined that "builds on" the existing LLVM IR by using LLVM
IR for sequential and optionally vector operations, while using separate new
constructs for expressing parallelism.  We outline five high-level goals in
extending LLVM to support parallelism in this manner:

1. The extension should be generally useful for handling modern parallel
   languages and parallel hardware.

2. The extension should have a minimal impact on existing analyses,
    transformations, code generators, and mainline client tools like
    front-ends, linkers, sanitizers and debuggers.

3. The extension should correctly express any restrictions required to
    preserve parallel semantics (of some parallel IR) in the face of LLVM's
    analyses and code transformations.

4. The extension should not inhibit effective compiler analysis and
    transformations of LLVM code fragments used by parallel, vector and
    offloading code.

5. The extension should support effective and efficient debugging of
    analyses and code transformations applied to parallel codes.

Although this document focuses on parallel compilers built on top of the
proposed extension mechanism, the mechanism itself is not limited to
parallelism: it could be used for introducing other kinds of information
into LLVM IR, e.g., for prefetching, vectorization, etc. (In fact, for this
reason, we considered and decided not to use "llvm.par." as the prefix for
the intrinsics.)

CORRECTNESS PROPERTIES

Code using the extension mechanism essentially needs to behave like ordinary
LLVM, while allowing specific parallel IRs layered on LLVM to be expressed
correctly.  This section informally lists and explains the requirements the
extension mechanism must satisfy in order to achieve its design goals.  The
sections Soundness and Required Changes to LLVM describe how the proposed
extension mechanism and associated changes to LLVM can support these
properties.

We use the following terminology:

o "Parallel code" : LLVM code using the extension mechanism.

o "Parallel IR" : An IR that uses the extension mechanism to embed new
  (presumably parallel) IR constructs and semantics into LLVM code.

o "Region" : A single-entry, single-exit section of LLVM code.

o "Marker" : An operation used by the extension mechanism to embed
operations
  with opaque external semantics into LLVM IR.  A marker may be used at the
  entry or exit of a region, or to embed an operation within a region.

A KEY ASSUMPTION UNDERLYING THIS RFC is that new parallel IRs defined with
this mechanism focus on parallel speedup but not on arbitrary concurrency,
i.e., that it is correct to execute any parallel program sequentially using
a single thread.  This is generally true when the primary goal of
parallelism is to improve performance, but may not be true when the program
must enable "simultaneous" execution of multiple communicating events, e.g.,
many multi-threaded interactive programs and programs using threads for
asynchronous I/O.  Both parallel and concurrent programs can be correctly
compiled by the existing LLVM infrastructure (through support for
synchronization, atomics, and a concurrent memory model), but the
infrastructure lacks any information about parallel semantics of the code
being compiled.  The goal of this RFC is to enable better parallel program
optimizations, which are not well supported in the current infrastructure
because they require being aware of and reasoning about parallel semantics.
Examples of such analyses and optimizations include "may-happen-in-parallel"
analysis, identifying redundant critical sections, eliding barriers, or
transforming loop nests to systematically improve outer-loop parallelization
or inner-loop vectorization or both.

The extension mechanism defined below uses "regions" to mark parallel tasks,
which means that a number of restrictions are required on LLVM code
transformations that move code across region boundaries.  These restrictions
must be enforceable within the mainline LLVM infrastructure, with no (or
very few) modifications to standard LLVM transformation passes, e.g., by
requiring the parallel IR to add pointer arguments to prevent code motion of
aliasing loads and stores.

We categorize the restrictions that the mechanism must enforce into (a)
structural properties, (b) control flow properties, (c) dataflow properties,
and (d) synchronization and memory model properties.

(A) Structural properties:

   1. The extension mechanism must be legal LLVM, i.e., must use standard
      LLVM operations (e.g., intrinsics, values, instructions, operand
      bundles, etc.)  that are recognized by the LLVM infrastructure.

   2. Code containing the extension mechanism must be correctly compiled by
      LLVM passes and supported LLVM back-ends, preferably with minimal
      changes to the existing infrastructure (although some initial changes
      are unavoidable).

(B) Control flow properties:

   1. The mechanism marks the beginning and end of regions, which are
      single-entry single-exit sections of LLVM code.

   2. If a region may be executed multiple times, e.g., for a parallel loop,
      this must be explicit in the LLVM IR.

   3. It must be POSSIBLE to prevent code motion of memory
      and synchronization operations across any of the three intrinsic..

   4. Critical edges due to conditional branches fed by the
      llvm.directive.marker() intrinsic should not be split.

(C) Dataflow and data dependence properties:

   1. SSA values defined in a region cannot be used outside the region.

   2. It must be POSSIBLE to restrict memory operations -- including loads,
      stores, and allocations -- from being moved across any of the three
      intrinsics by LLVM passes. In particular, a parallel IR should
      be able to choose whether or not to enforce this restriction for any
      particular memory operation at any particular region boundary.

      Rationale: The restriction -- and the freedom to relax it -- may both
      be important for several reasons, e.g.,
         * To avoid moving alloca's outside of parallel tasks.
         * To avoid introducing new loop-carried dependencies.
         * To avoid introducing global memory accesses in parallel without
           proper synchronization.
         * Some memory optimizations may be important for performance and
           should not be inhibited, e.g., hoisting a load of a
           loop-invariant value out of a parallel loop.

   3. It must be POSSIBLE to prevent new (SSA) pointer variables --
      including allocations, function arguments, return values, loads, GEPs,
      and casts to pointers -- from being introduced into a function
      containing an extension operation, unless those pointer variables are
      added as arguments to the appropriate extension operations.

(D) Synchronization and memory model properties:

    1. LLVM synchronization and concurrency constructs, e.g., load atomic,
       store atomic, atomicrmw, cmpxchg, and fence, can be used within
       parallel regions.

    2. It must be POSSIBLE to restrict synchronization operations --
       including atomics, fences, and volatile memory accesses -- from being
       moved across any of the three intrinsics by LLVM passes.

PROPOSED EXTENSION MECHANISM

Our proposed extension mechanism makes use of intrinsics and tokens,
OperandBundles, and two translators that parallel compilers can define for
their specific needs and workflows.  This infrastructure is flexible enough
to support representation of explicit parallel IRs, and also of source-level
directives for parallelization, vectorization, and offloading of both loops
and more-general code regions.

We avoid using LLVM metadata to express the extension mechanisms.  Instead,
based on the work done in [1][2][3][4] and on feedback from the LLVM
development community, we leverage the LLVM token and OperandBundle
constructs together with three new LLVM intrinsic functions.

The updated LLVM IR proposal is summarized below.

-------LLVM Intrinsic Functions-------

Essentially, the LLVM OperandBundles, the LLVM token type, and three new
LLVM directive intrinsics form the foundation of the proposed extension
mechanism.

The three newly introduced LLVM intrinsic functions are the following:

    token @llvm.directive.region.entry()[]
    token @llvm.directive.region.entry()[]
    i1 @llvm.directive.marker()[]

More concretely, these intrinsics are defined using the following
declarations:

    // Directive and Qualifier Intrinsic Functions
    def int_directive_region_entry : Intrinsic<[llvm_token_ty],[], []>;
    def int_directive_region_exit : Intrinsic<[], [llvm_token_ty], []>;
    def int_directive_marker : Intrinsic <[llvm_i1_ty], [], []>;

As described in Section SOUNDNESS, several correctness properties are
maintained using OperandBundles on calls to these intrinsics.  In
LLVM, an OperandBundle has a tag name (a string to identify the
bundle) and an operand list consisting of zero or more operands. For
example, here are two OperandBundles:

    "TagName01"(i32 *%x, f32 *%y, 7)
    "AnotherTagName"()

The tag name of the first bundle is "TagName01", and it has an operand list
consisting of three operands, %x, %y, and 7. The second bundle has a tag
name "AnotherTagName" but no operands (it has an empty operand list).

The above new intrinsics allow:
* Annotating a code region marked with directives / pragmas / explicit
  parallel function calls.
* Annotating values associated with the region (or loops), that is, those
  values associated with directives / pragmas.
* Providing information on LLVM IR transformations needed for the annotated
  code regions (or loops).
* Introducing parallel IR constructs for (one of) a variety of different
  parallel IRs, e.g., Tapir or HPVM.
* Most LLVM scalar and vector analyses and optimizations to be applied to
  parallel code without modifications to the passes, and without requiring
  parallel "tasks" to be outlined into separate, isolated functions.

These intrinsics can be used both by frontends and also by transformation
passes (e.g. automated parallelization).

The names used here are open to discussion.

--------Three Example Uses---------

Below, we show three very brief examples using three IRs: OpenMP [5],
Tapir [4] and HPVM [6].  Somewhat larger code examples are shown in the
Appendix of the accompanying Google Doc.

----Tapir IR----

; This simple Tapir loop uniformly scales each element of a vector of
; integers in parallel.
pfor.detach.lr.ph:
 %wide.trip.count = zext i32 %n to i64
 br label %pfor.detach

pfor.detach:                          ; preds = %pfor.inc, %
pfor.detach.lr.ph
 %indvars.iv = phi i64 [ 0, %pfor.detach.lr.ph ], [ %indvars.iv.next,
%pfor.inc ]
 detach label %pfor.body, label %pfor.inc

pfor.body:                            ; preds = %pfor.detach
 %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
 %0 = load i32, i32* %arrayidx, align 4
 %mul3 = mul nsw i32 %0, %a
 store i32 %mul3, i32* %arrayidx, align 4
 reattach label %pfor.inc

pfor.inc:                             ; preds = %pfor.body, %pfor.detach
 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
 %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
 br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach

pfor.cond.cleanup:                    ; preds = %pfor.inc
 sync label %sync.continue

---Tapir using LLVMPar intrinsics-----

; This simple parallel loop uniformly scales each element of a vector of
; integers.
pfor.detach.lr.ph:                    ; preds = %entry
  %wide.trip.count = zext i32 %n to i64
  br label %pfor.detach

pfor.detach:                          ; preds = %pfor.inc, %
pfor.detach.lr.ph
  %indvars.iv = phi i64 [ 0, %pfor.detach.lr.ph ], [ %indvars.iv.next,
%pfor.inc ]
  %c = call i1 @llvm.directive.marker()["detach_task"]
  br i1 %c, label %pfor.body, label %pfor.inc

pfor.body:                            ; preds = %pfor.detach
  %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4
  %mul3 = mul nsw i32 %0, %a
  store i32 %mul3, i32* %arrayidx, align 4
  call i1 @llvm.directive.marker()["reattach_task"]
  br label %pfor.inc

pfor.inc:                             ; preds = %pfor.body, %pfor.detach
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach

pfor.cond.cleanup:                    ; preds = %pfor.inc, %entry
  call i1 @llvm.directive.marker()["local_barrier"]
  br label %sync.continue

Comment: If necessary, one can prevent hoisting of the getelementptr
instruction %arrayidx or the load instruction %0 in the above example
using the intrinsics @llvm.directive.region.entry,
@llvm.directive.region.exit, and @llvm.launder.invariant.group
intrinsics appropriately within pfor.body.

----HPVM----

; The function vector_add() performs point to point addition of incoming
; arguments, A and B, replicated at run-time across N parallel instances.
; We omit dataflow edges showing incoming/outgoing values.
;
     %node = call i8* @llvm.hpvm.createNode1D(
      i8* bitcast %retStruct (i32*, i32, i32*, i32, i32*, i32) @vector_add
          to i8*,
      i32 %N)

----HPVM using LLVMPar intrinsics----

   ...           ; code using A, B, C, N
   ; The HPVM node function @vector_add is now inlined
   %region = call token @llvm.directive.region.entry()[
      "HPVM_create_node"(%N),
      "dataflow_values"(i32* %A, i32 %bytesA, i32* %B, i32 %bytesB,
      i32* %C, i32 %bytesC),
      "attributes"(i32 0, i32 -1, i32 0, i32 -1, i32 1, i32 -1) ]
    ; 0 = ‘in', 1 = ‘out', 2 = ‘inout', -1 for non pointer arguments

; Loop structure corresponding to %N instances of vector_add()
%header: ...
    ; parallel loop with trip count %N, index variable %loop_index
    %loop_index = phi i64 [ 0, %preheader ], [ %loop_index.next, %latch ]

    %c = call i1 @llvm.directive.marker()["detach_task"]
    br i1 %c, label %body, label %latch

%body:
    ; Loop index, instead of HPVM intrinsic calls to generate index
    %ptrA = getelementptr i32, i32* %A, i32 %loop_index
    %ptrB = getelementptr i32, i32* %B, i32 %loop_index
    %ptrC = getelementptr i32, i32* %C, i32 %loop_index

    %a = load i32, i32* %ptrA
    %b = load i32, i32* %ptrB
    %c = add i32, i32 %a, i32 %b
    store i32 %c, i32* %ptrC

    %ignore = call i1 @llvm.directive.marker()["reattach_task"]
    br label %latch

%latch:
    %loop_index.next = add nuw nsw i64 %loop_index, 1
    %exitcond = icmp eq i64 %loop_index.next, %N
    br i1 %exitcond, label %loop.end, label %header

%loop.end:
    call void @llvm.directive.region.exit(token %region)[
    "HPVM_create_node"(), "dataflow_values" () ]

    ...        ; code using A, B, C, N

________________

PROPOSED WORKFLOWS

The proposed intrinsics can be used in a variety of compiler designs.  We
illustrate two possible designs, the first used by the Intel OpenMP compiler
and the second to be used by a planned port of Tapir+HPVM to these
intrinsics.  For each of these workflows, the extension mechanism supports
two translators:

PREPARE: This is used to produce LLVM IR containing the intrinsics. This
    must ensure that the initial generated IR with intrinsics satisfies the
    properties listed above.

EXTRACT: This is used to convert LLVM IR containing the intrinsics into
    some parallel IR, e.g., WRegion for Intel's OpenMP compiler, or
Tapir+HPVM
    for the second compiler.

These two are referred to as translators, not passes, because their order is
not interchangeable with other LLVM passes.

The first workflow, sketched below, uses the front-end (e.g., Clang for
OpenMP) to generate LLVM IR containing the intrinsics directly. The PREPARE
translator runs at the end of the front-end, to ensure the required
properties.  Standard LLVM passes can operate on this IR, before converting
the EXTRACT translator generates the parallel IR.  In principle, this
parallel IR could later be converted back into LLVM IR with intrinsics, but
this may or may not be possible (e.g., the WRegion IR is not converted back
and it is considered difficult to do so).

         Front-end
            +               LLVM
         PREPARE            Passes             EXTRACT          Backend
Parallel -------> LLVM IR + ------> Optimized ------> Parallel ------>
Target
source            extensions        LLVM IR +           IR             Code
program                             extensions
                      ^                                  |
                      |                                  |
                      |__________________________________|
                                PREPARE (optional)

                        Workflow 1: E.g., OpenMP Compiler

The second workflow first generates a parallel compiler's IR (e.g., Tapir or
HPVM) directly from some front-end.  The PREPARE phase converts the parallel
compiler IR into LLVM IR with the intrinsics, while ensuring that this
initial IR adheres to the required correctness properties.  Standard LLVM
passes can then be run on this IR. Eventually, the EXTRACT translator
converts the IR back to the parallel compiler IR. In this workflow, it
should be possible to convert back and forth between Parallel IR and
Extended LLVM IR multiple times using the PREPARE and EXTRACT phases each
time.

                                  LLVM
         FE       PREPARE         Passes          EXTRACT         Backend
Parallel --> Para- ---> LLVM IR + ----> Optimized -----> Parallel ----->
Target
source       llel      extensions       LLVM IR +          IR
Code
program      IR                                         extensions
                           ^                                |
                           |                                |
                           |________________________________|
                                    PREPARE (optional)

                        Workflow 2: E.g., Tapir, HPVM Compilers

The key difference between these two workflows is that the first one may be
used to embed source-level information (e.g., OpenMP source directives)
directly within the LLVM IR, whereas in the second one, the Parallel IR
is embedded within LLVM IR rather than some source-level information.

SOUNDNESS

This section examines what facilities the proposed extension mechanism
provides to maintain the necessary correctness properties for parallel IR's.
The section examines each correctness property in turn.  We see that some
changes to mainline LLVM passes are needed to use these facilities and
maintain the correctness properties.  The section "Required Changes to LLVM"
describes what changes to LLVM passes are needed to use these facilities to
maintain the correctness properties.

(A) Structural Properties

    1. Follows directly from the fact that the extension mechanism is LLVM
       intrinsics.

    2. Given the correct use of the tools described in the rest of the
       section, follows directly from (i).

(B) Control Flow Properties

    1. The tokens returned by a region entry must be passed to the matching
       region exit, which enforces single-entry regions by ensuring that a
       region entry dominates all region exits.  If a particular parallel IR
       needs it, the PREPARE and EXTRACT translators are responsible for
       enforcing the single exit property.

    2. This is achieved using Tapir-style use of looping with each iteration
       spawning a parallel execution of the body. See the Tapir loop
       examples for an example of how this can be implemented.

    3. To prevent code motion of memory operations across region boundaries,
       one can use pointer arguments and the OperandBundles described in the
       previous section.  Only pointers to local allocations and internal
       globals need to be passed to relevant intrinsics.

    4. Would require changes to the SplitCriticalEdge utility function in
       LLVM.[b]

(C) Dataflow and data dependence properties

    1. There are two parts to ensuring this property. The first is that
       correct usage of the marker intrinsic gives the implementer control
       over dominator relations, allowing the implementer to prevent direct
       references of SSA values defined in a parallel region. Second, these
       values can still be referenced by phi nodes, and so changes must be
       made to LLVM to prevent such phi node creation.[b]

    2. Prevention of particular memory operations being moved across
       intrinsics is achieved by appropriate use of OperandBundles.

    3. Would likely require changes to LLVM, as described in section
       Required Changes to LLVM.[b]

(D) Synchronization and memory model properties

    1. Follows directly from A(i).

    2. Mechanism is identical to C(ii) for memory operations specific to
       particular locations, because most synchronization operations take a
       pointer argument.  For memory-wide operations (which don't take such
       an argument), such as fences, as long as any pointer is passed to an
       OperandBundle, LLVM will be prohibited from moving a fence across the
       intrinsic calls, as desired.

REQUIRED CHANGES TO LLVM

The soundness properties above rely almost entirely on standard correctness
requirements for compiler transformations, e.g., transformations, including
IPO passes, will not move memory or synchronization operations across calls
to opaque external functions that can access an aliasing object; opaque
external functions could access any memory object reachable from any
(non-internal) global pointer or argument pointer.

Some properties, however, require additional changes to a few LLVM
passes.  We present a high-level list of such changes known to us so
far.  This list is a work-in-progress, based on our experience
developing parallel compilers.  We invite feedback on the content of
this list.  Patches will be submitted to the list, after discussion on
this RFC.

* New allocations and new internal globals must add pointer arguments to
  intrinsics: To enforce properties B(iii) and C(iii), passes that introduce
  new local allocations and new internal globals will need to be modified to
  add pointer arguments to the relevant intrinsics.  There aren't too many
  such passes today: Inliner, Internalize, Reg2Mem, SROA.

   * Note: Some optional passes not used by current parallel compilers may
     be omitted, such as StatePointRewriting (if GC is not needed), the
     Sanitizers, etc.

* Inliner must not move allocas to function entry: When a callee that
  contains an alloca is inlined at a call site, the alloca must be kept at
  the call-site and not moved to function entry. (Technically, this is only
  necessary for call sites within parallel regions in the caller function,
  but we can change it for all call sites so that the inliner doesn't have
  to look for enclosing regions.)

   * Note: Because of this change, passes that aim to optimize allocas must
     be modified to look for these inlined allocas.

* PHI-node construction: Enforcing property C(i) requires
  modifications to how PHI nodes are constructed in LLVM.  From our
  experience, the bulk of the necessary changes lie in Mem2Reg pass
  and SSAUpdater.  But a couple passes, such as GVN, update PHI nodes
  themselves in simple ways, and some changes might be required to
  these passes.

* SplitCriticalEdge utility: To enforce B(iv), modifications are
  needed to prevent splitting critical edges due to branches
  conditioned on @llvm.directive.marker values.  This change can be
  contained to the SplitCriticalEdge utility function, which controls
  critical-edge splitting.

AUTHORS AND ACKNOWLEDGEMENTS

The initial design of the three intrinsics was developed by Xinmin Tian and
Hal Finkel, and prototyped in the Intel OpenMP compiler.

The discussions and writing for this document, including the correctness
requirements, soundness arguments and examples, were led by Vikram Adve, Hal
Finkel, Maria Kotsifakou, Tao (TB) Schardl, George Stelle and Xinmin Tian.

Joel Denny, Johannes Doerfert, Seyong Lee, William Moses, Hashim Sharif,
Jeff
Vetter and Wael Yahia provided valuable input during the discussions and
significant comments on the RFC.

REFERENCES

1. H. Finkel and X. Tian "[llvm-dev] RFC: A Proposal for adding an
   experimental IR-level region-annotation infrastructure", Jan. 11, 2017.
   http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html.

2. H. Saito, et.al., "Extending LoopVectorizer towards supporting OpenMP 4.5
   SIMD and outer loop auto-vectorization", LLVM Developer’s Conference,
   Nov. 2016

3. X. Tian, H. Saito, E. Su, A. Gaba, M. Masten, E. Garcia, A. Zaks, "LLVM
   Framework and IR Extensions for Parallelization, SIMD Vectorization and
   Offloading".  LLVM-HPC at SC 2016: 21-31.

4. T.B. Schardl, W.S. Moses, C.E. Leiserson, "Tapir: Embedding Fork-Join
   Parallelism into LLVM's Intermediate Representation", In PPoPP, 2017.
   https://doi.org/10.1145/3018743.3018758

5. Intel Corporation, "LLVM Intrinsic function and Tag name string interface
   specification for directive representation", January 22, 2018.

6. M. Kotsifakou, P. Srivastava, M.D. Sinclair, R. Komuravelli, V. Adve,
   S. Adve.  "HPVM: Heterogeneous Parallel Virtual Machine."  In PPoPP,
   2018. https://doi.org/10.1145/3178487.3178493

APPENDIX

See Appendix in the Google doc at https://bit.ly/LLVMParRFC

This Appendix includes three substantial examples of using this RFC for
OpenMP, Tapir and HPVM compilers.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20181016/4694ea68/attachment.html>