[cfe-dev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Fri Aug 10 13:06:17 PDT 2012

Hello,

I'd like to see support in clang/LLVM for multi-core parallelism,
especially support for OpenMP. I think that the best way to do this is
by designing an LLVM-based API (metadata and intrinsics) for
expressing parallelism constructs, and having clang lower OpenMP code
to that API. This will allow maximal preservation of optimization
capabilities including target-specific lowering. What follows outlines
a set of metadata and intrinsics which should allow support for the
full OpenMP specification, and I'd like to know what the community
thinks about this.

As a general note: My intent here is to make the metadata safe in the
traditional sense: it can be removed by optimization passes that don't
understand it, and while this might result in the loss of the
parallelization, the removal will not be otherwise unsafe. I believe
that many existing passes will require minor modification in order to
preserve the metadata as appropriate, but I think these changes are
relatively small. In addition, the authors of passes that preserve
parallelization by dealing with parallelization metadata will need to
explicitly think about how to handle it; hopefully, this will yield
fewer bugs.

In the following I will outline the API and explain how OpenMP will be
lowered. My idea is to follow OpenMP's semantics, so if these differ
from the OpenMP spec, then I'd like to correct that. If there are other
parallelism models that we would like to support, then I think those
can be incorporated as well (maybe something with lightweight tasks
such as Cilk).

---- Parallel Regions ----

Inside a parallel region, a team of threads execute the sequence of
instructions.

A parallel region is specified by a function. This function may be
executed by one or more threads in parallel. In terms of OpenMP:
private() variables become variables local to the function.
firstprivate() variables become parameters to the function. shared()
variables become pass-by-pointer parameters. If the shared variable is
not a global, then we allocate a local copy, using alloca followed by a
store, and pass the new pointer to the function. For copyin()
variables, we pass a copy of the variable to the function, and the
function then uses that copy to update the thread's version of the
(TLS) variable. The function should have private (or internal) linkage
for optimization purposes.

To mark this function as a parallel region, a module-level 'parallel'
metadata entry is created. The call site(s) of this function are marked
with this metadata,. The metadata has entries:
 - The string "region"
 - A reference to the parallel-region function
 - If applicable, a list of metadata references specifying
special-handling child regions (parallel loops and serialized/critical
regions)

If the special-handling region metadata is no longer referenced by code
within the parallel region, then the region has become invalid, and
will be removed (meaning all parallelization metadata will be removed)
by the ParallelizationCleanup. The same is true for all other
cross-referenced metadata below.

Note that parallel regions can be nested.

As a quick example, something like:
int main() {
  int a;
#pragma omp parallel firstprivate(a) 
  do_something(a)
  ...
}

becomes something like:

define private void @parreg(i32 %a) {
entry:
  call void @do_something(i32 %a)
  ret
}

define i32 @main() {
entry:
...
call void @parreg1(i32 %a) !parallel !0
...

!0 = metadata !{ metadata !"region", @parreg }

-- Reductions --

To handle reductions, first, the variable is converted into a output
pass-by-pointer parameter to the function. The pointer refers to an
array of values, one for each thread that will execute the region.
After the region completes, a loop must be created to actually perform
the requested reduction. Inside the parallel region, each thread
accesses its value using its thread id as the index. See the nthreads
and tidx intrinsics below.

-- Special handling regions --

- Serial Regions -

Serial regions within parallel blocks (called 'single' in OpenMP) are
executed only by one thread. As with parallel regions themselves, they
are lowered as functions; the call site(s) of these functions are
tagged with 'parallel' metadata. This metadata has entries:
  - The string "serial"
  - A reference to the single-region function
  - A metadata reference to the parent parallel-region or loop metadata
  - Optionally, a type: "master" or "any" (the default)

For regions with "master" only the master thread may execute the
region.

- Critical Regions -

Critical regions are like serial regions, but they are executed by all
threads with mutual-exclusion. These are identified by 'parallel'
metadata with entries:
  - The string "critical"
  - A reference to the critical-region function
  - A metadata reference to the parent parallel-region, loop or task
metadata
  - Optionally, a global name string used for non-local synchronization
(all regions with the same name string are mutually exclusive)

- Loops -

Parallel loops are indicated by tagging all backedge branches with
'parallel' metadata. This metadata has the following entries:
  - The string "loop"
  - A metadata reference to the parent parallel-region metadata
  - Optionally, a string specifying the scheduling mode: "static",
"dynamic", "guided", "runtime", or "auto" (the default)
  - Optionally, an integer specifying the number of loop levels over
which to parallelize (the default is 1)
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the loop.

Note that what makes this metadata safe is the cross referencing
between the parent region metadata, the loop metadata and the metadata
references on the instructions. If any of these are removed or become
inconsistent, then the whole parallel region must be removed. The
ParallelizationCleanup pass will check this prior to lowering.

To lower lastprivate() OpenMP variables, first we allocate a copy of
the variable outside the loop. At the end of the loop body we insert a
check to determine if the current iteration is the last one (over all
threads), and if so, we update the common copy with the local version.
Note that for OpenMP loops that have private, firstprivate, etc.
clauses that cannot be made part of the parent parallel region, these
loops will also need to be placed into their own functions to handle
the relevant scope issues.

Ordered regions (those which much execute in the original iteration
order) are lowered as functions, much in the same way as serial
regions. The call site(s) are tagged with 'parallel' metadata. This
metadata has entries:
  - The string "ordered"
  - A reference to the function specifying the ordered region
  - A metadata reference to the parent parallel loop

Serial regions and loop that don't have the 'nowait' OpenMP clause must
be followed by a barrier intrinsic.

- Tasks -

Explicit tasks are also lowered as functions similar to other special
handling regions. Their call site(s) are marked with 'parallel'
metadata. Depending on the implementation, they may not actually start
executing until the main thread executes a taskwait intrinsic or
reaches the end of the parallel region. The task metadata has:
  - The string "task"
  - A reference to the function specifying the task
  - A metadata reference to the parent region, task, loop, etc.
  - Optionally, an affinity mode: "untied" or "tied" (the default). In
tied mode, once a task starts executing in a particular thread, it must
continue to execute in that thread until completion. An untied task can
be passed in between threads.
  - If applicable, a list of metadata references specifying ordered and
serial/critical regions within the task.

-- Intrinsics --

Because metadata does not count as a variable use, and some runtime
controls take general expressions, supporting these requires
intrinsics. Many of these intrinsics are tied to their parent parallel
regions by taking a metadata parameter specifying the parallel region,
loop, etc.

void @llvm.parallel.if(i1, !) - Takes a boolean expression controlling
whether the referenced region (or task) is executed in parallel (the
true case) or in serial (the false case). For a task, this controls the
choice between queued or immediate in-place execution.

void @llvm.parallel.final(i1, !) - Takes a boolean expression
controlling whether the referenced task is considered final. A final
task can have no subtasks (or, for that matter, nested parallel
regions).

void @llvm.parallel.setnt(i32, !) - Specify the number of threads used
to execute the parallel region.

i32 @llvm.parallel.nthreads(!) - Determine the total number of threads
that will be used to execute the referenced parallel region (this is
used to setup the array for reductions).

i32 @llvm.parallel.tidx(!) - Obtain the current thread index; this is
not the global thread id, or even the application-specific thread id.
These indices run only from 0 through one less than the total number of
threads active in the referenced region (this is used to access
elements in a reduction array).

void @llvm.parallel.chunksz(i32 or i64, !) - Specify the size of the
chunks used to decompose a parallel loop. The metadata reference is to
the metadata which tags the loop backedges.

void @llvm.parallel.barrier() - A barrier for all threads in the
current parallel region.

void @llvm.parallel.taskwait() - Wait for all child tasks of the
current task (or all top-level tasks).

void @llvm.parallel.taskyield() - Optionally yield execution to other
tasks.

---- Parallel Sections ----

OpenMP parallel sections are lowered as parallel loops. The loop
executes a fixed number of times (once per section), and within the
loop body a switch statement selects the correct section (in order)
based on the iteration number.

---- Thread-Local Data ----

#pragma omp threadprivate(<variable-list>), which applies only to
global variables, is handled by declaring global variables with the
existing thread_local attribute.

---- Atomic Operations ----

OpenMP atomic operations are encoded using existing LLVM atomic
intrinsics.

---- Flush ----

In general, an OpenMP flush operation, regardless of the contents of
the variable list, can be lowered as: fence seq_cst.

---- Passes ----

-- Early Passes --

ParallelRegionWidening - This is an early pass that tries to combine
consecutive parallel regions. Non-parallel "in between" regions can be
converted into serialized blocks. This can be done so long as any
reductions can be delayed until the end of the last region, and any
converted serial regions do not have external function calls or inline
assembly regions (both of which could be sensitive to the real number
of active threads). This not only reduces thread-startup overhead, but
will also allow other optimizations, such as loop fusion.

-- Late Passes (Lowering) --

The parallelization lowering will be done by IR level passes in CodeGen
prior to SelectionDAG conversion. Currently, this means after
loop-strength reduction. Like loop-strength reduction, these IR level
passes will get a TLI object pointer and will have target-specific
override capabilities.

ParallelizationCleanup - This pass will be scheduled prior to the other
parallelization lowering passes (and anywhere else we decide). Its job
is to remove parallelization metadata that had been rendered
inconsistent by earlier optimization passes. When a parallelization
region is removed, any parallelization intrinsics that can be removed
are then also removed.

ParallelizationLowering - This pass will actual lower paralleliztion
constructs into a combination of runtime-library calls and, optionally,
target-specific intrinsics. I think that an initial generic
implementation will target libgomp.

* I would like to see support for OpenMP 3.1 [1] plus an extension for
  user-defined-reductions (UDRs) [2]. 

[1] OpenMP Specification 3.1. July, 2011.
    http://www.openmp.org/mp-documents/OpenMP3.1.pdf

[2] A. Duran, et al. "A proposal for User-Defined Reductions in
OpenMP". IWOMP, 2010.
http://www.ccs.tsukuba.ac.jp/workshop/IWOMP2010/slides/Alex-udrs.pdf

Thanks again,
Hal

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory