[PATCH] D60275: [RFC] R extensions for annotating directive region entry and exit with a set of OpBundle name definitions for OpenMP

Thu Apr 4 10:19:51 PDT 2019

xtian created this revision.
xtian added reviewers: hfinkel, jdoerfert, ABataev, vadve.
xtian added a project: LLVM.
Herald added subscribers: llvm-commits, jfb, guansong, Prazek, mgorny.

This is the first patch as a starting point for two RFCs list below.

[llvm-dev] [RFC] IR-level Region Annotations 
[llvm-dev] [RFC] An Extension Mechanism for Parallel Compilers Based on LLVM

The updated LLVM IR proposal is summarized below.

-------LLVM Intrinsic Functions-------

Essentially, the LLVM OperandBundles, the LLVM token type, and three new
LLVM directive intrinsics form the foundation of the proposed extension
mechanism.

The three newly introduced LLVM intrinsic functions are the following:

  token @llvm.directive.region.entry()[]
  token @llvm.directive.region.entry()[]
  i1 @llvm.directive.marker()[]

More concretely, these intrinsics are defined using the following
declarations:

  // Directive and Qualifier Intrinsic Functions
  def int_directive_region_entry : Intrinsic<[llvm_token_ty],[], []>;
  def int_directive_region_exit : Intrinsic<[], [llvm_token_ty], []>;
  def int_directive_marker : Intrinsic <[llvm_i1_ty], [], []>;

As described in Section SOUNDNESS, several correctness properties are
maintained using OperandBundles on calls to these intrinsics.  In
LLVM, an OperandBundle has a tag name (a string to identify the
bundle) and an operand list consisting of zero or more operands. For
example, here are two OperandBundles:

  "TagName01"(i32 *%x, f32 *%y, 7)
  "AnotherTagName"()

The tag name of the first bundle is "TagName01", and it has an operand list
consisting of three operands, %x, %y, and 7. The second bundle has a tag
name "AnotherTagName" but no operands (it has an empty operand list).

The above new intrinsics allow:

- Annotating a code region marked with directives / pragmas / explicit parallel function calls.
- Annotating values associated with the region (or loops), that is, those values associated with directives / pragmas.
- Providing information on LLVM IR transformations needed for the annotated code regions (or loops).
- Introducing parallel IR constructs for (one of) a variety of different parallel IRs, e.g., Tapir or HPVM.
- Most LLVM scalar and vector analyses and optimizations to be applied to parallel code without modifications to the passes, and without requiring parallel "tasks" to be outlined into separate, isolated functions.

These intrinsics can be used both by frontends and also by transformation
passes (e.g. automated parallelization).

The names used here are open to discussion.

--------Three Example Uses---------

Below, we show three very brief examples using three IRs: OpenMP [5],
Tapir [4] and HPVM [6].  Somewhat larger code examples are shown in the
Appendix of the accompanying Google Doc.

----Tapir IR----

; This simple Tapir loop uniformly scales each element of a vector of
; integers in parallel.
pfor.detach.lr.ph:
 %wide.trip.count = zext i32 %n to i64
 br label %pfor.detach

pfor.detach:                          ; preds = %pfor.inc, %
pfor.detach.lr.ph
 %indvars.iv = phi i64 [ 0, %pfor.detach.lr.ph ], [ %indvars.iv.next,
%pfor.inc ]
 detach label %pfor.body, label %pfor.inc

pfor.body:                            ; preds = %pfor.detach
 %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
 %0 = load i32, i32* %arrayidx, align 4
 %mul3 = mul nsw i32 %0, %a
 store i32 %mul3, i32* %arrayidx, align 4
 reattach label %pfor.inc

pfor.inc:                             ; preds = %pfor.body, %pfor.detach
 %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
 %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
 br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach

pfor.cond.cleanup:                    ; preds = %pfor.inc
 sync label %sync.continue

---Tapir using LLVMPar intrinsics-----

; This simple parallel loop uniformly scales each element of a vector of
; integers.
pfor.detach.lr.ph:                    ; preds = %entry

  %wide.trip.count = zext i32 %n to i64
  br label %pfor.detach

pfor.detach:                          ; preds = %pfor.inc, %
pfor.detach.lr.ph

  %indvars.iv = phi i64 [ 0, %pfor.detach.lr.ph ], [ %indvars.iv.next,

%pfor.inc ]

  %c = call i1 @llvm.directive.marker()["detach_task"]
  br i1 %c, label %pfor.body, label %pfor.inc

pfor.body:                            ; preds = %pfor.detach

  %arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
  %0 = load i32, i32* %arrayidx, align 4
  %mul3 = mul nsw i32 %0, %a
  store i32 %mul3, i32* %arrayidx, align 4
  call i1 @llvm.directive.marker()["reattach_task"]
  br label %pfor.inc

pfor.inc:                             ; preds = %pfor.body, %pfor.detach

  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %pfor.cond.cleanup, label %pfor.detach

pfor.cond.cleanup:                    ; preds = %pfor.inc, %entry

  call i1 @llvm.directive.marker()["local_barrier"]
  br label %sync.continue

Comment: If necessary, one can prevent hoisting of the getelementptr
instruction %arrayidx or the load instruction %0 in the above example
using the intrinsics @llvm.directive.region.entry,
@llvm.directive.region.exit, and @llvm.launder.invariant.group
intrinsics appropriately within pfor.body.

----HPVM----

; The function vector_add() performs point to point addition of incoming
; arguments, A and B, replicated at run-time across N parallel instances.
; We omit dataflow edges showing incoming/outgoing values.
;

  %node = call i8* @llvm.hpvm.createNode1D(
   i8* bitcast %retStruct (i32*, i32, i32*, i32, i32*, i32) @vector_add
       to i8*,
   i32 %N)

----HPVM using LLVMPar intrinsics----

  ...           ; code using A, B, C, N
  ; The HPVM node function @vector_add is now inlined
  %region = call token @llvm.directive.region.entry()[
     "HPVM_create_node"(%N),
     "dataflow_values"(i32* %A, i32 %bytesA, i32* %B, i32 %bytesB,
     i32* %C, i32 %bytesC),
     "attributes"(i32 0, i32 -1, i32 0, i32 -1, i32 1, i32 -1) ]
   ; 0 = ‘in', 1 = ‘out', 2 = ‘inout', -1 for non pointer arguments

; Loop structure corresponding to %N instances of vector_add()
%header: ...

  ; parallel loop with trip count %N, index variable %loop_index
  %loop_index = phi i64 [ 0, %preheader ], [ %loop_index.next, %latch ]

  %c = call i1 @llvm.directive.marker()["detach_task"]
  br i1 %c, label %body, label %latch

%body:

  ; Loop index, instead of HPVM intrinsic calls to generate index
  %ptrA = getelementptr i32, i32* %A, i32 %loop_index
  %ptrB = getelementptr i32, i32* %B, i32 %loop_index
  %ptrC = getelementptr i32, i32* %C, i32 %loop_index

  %a = load i32, i32* %ptrA
  %b = load i32, i32* %ptrB
  %c = add i32, i32 %a, i32 %b
  store i32 %c, i32* %ptrC

  %ignore = call i1 @llvm.directive.marker()["reattach_task"]
  br label %latch

%latch:

  %loop_index.next = add nuw nsw i64 %loop_index, 1
  %exitcond = icmp eq i64 %loop_index.next, %N
  br i1 %exitcond, label %loop.end, label %header

%loop.end:

  call void @llvm.directive.region.exit(token %region)[
  "HPVM_create_node"(), "dataflow_values" () ]

  ...        ; code using A, B, C, N

________________

Repository:
  rL LLVM

https://reviews.llvm.org/D60275

Files:
  llvm/include/llvm/IR/CMakeLists.txt
  llvm/include/llvm/IR/GlobalValue.h
  llvm/include/llvm/IR/Intel_Directives.td
  llvm/include/llvm/IR/Intrinsics.td
  llvm/include/llvm/IR/Module.h

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D60275.193739.patch
Type: text/x-patch
Size: 16010 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20190404/ce767e0f/attachment-0001.bin>