[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Wed Sep 26 06:36:29 PDT 2012

Hi,

Sorry for the hiatus, busy time at my university. :)

After a false start and some (hopefully cogent) thought, I am now of
the opinion that it will be better to have llvm natively support a
somewhat different notion of parallel computation and have the
frontend lower OpenMP directives (and possibly other such things) into
the same.

In short, I propose a intrinsic based approach which hinges on the
concept of a "parallel map".  The immediate effect of using intrinsics
is that we no longer have to worry about missing metadata.  Moreover,
we are still free to lower the intrinsics in a variety of ways --
including vectorizing them or lowering them to calls to an actual
openmp backend.

I have also tried to make this representation more orthogonal and
general; mirroring a significant subset of openmp directives inside
llvm's IR doesn't feel right.  For instance, in the following
proposal, the OpenMP TASK directive is lowered using the more general
parallel_map construct.  A pass lowering the intrinsics into an OpenMP
backend may "reverse engineer" the mentioned pattern into tasks when
possible, but, in principle, I think the directives we introduce into
llvm's IR are best left as mutually exclusive as possible.  Keeping
the intrinsics simple and orthogonal should also help in asserting and
verifying correctness.

I plan to first implement a null lowering pass which simply lowers intrinsics
to something semantically correct but with no multithreading.  Once that is
done, I'll try to lower the intrinsics into something more interesting, perhaps
libgomp or maybe even a custom runtime.

The proposal
------------

I propose introducing four new intrinsics to llvm:

1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)

Semantics:

Executes `limit` copies of fn, _possibly_ in parallel.  The map index
(i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
passed to each of the invocations.  The only invariant is that the 0th
iteration is always executed in the invoking thread.

It is legal to have calls to parallel_map inside a function being
parallel_map'ed over.

2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)

Semantics:

It is only legal to call sync_region from within the dynamic extent of
a parallel_map.  It ensures that `limit` copies (the `limit` is the
limit from the parallel_map) of fn are executed with mutual exclusion.

`type` can either be 0 (`Any`) signifying that the synchronized
regions can be run in any order or 1 (`Ordered`) signifying that the
synchronized regions must be run in increasing order of the index.

3. i32 @llvm.get_num_threads()

Semantics:

Returns the number of threads in the thread pool.

4. i32 @llvm.set_num_threads(i32)

Set the number of threads in the thread pool.

It should be possible to lower all OpenMP directives to the above four
intrinsics in the frontend (please read this in conjunction with [1]):

Parallel regions can be lowered as a parallel_map with
@llvm.num_threads as the limit.

```
#pragma PARALLEL
  block
```

desugars to

```
@llvm.parallel_map(num_threads, block_closure, shared_closure)
...

void block_closure(i32 tid, i8* shared_vars) {
 block
}
```

Reductions are handled by a parallel_map followed by a regular
reduction loop (exactly as in Hal's proposal).

Serial blocks reduce to a block conditional on the index inside the
function being parallelly mapped.

We lower critical and ordered regions into calls to `sync_region`.

Tasks are lowered (recursively) this way:

```
TASK
  block
more_code_may_contain_more_tasks
TASK_WAIT
```

desugars to

```
@llvm.parallel_map(2, task_closure, task_priv)

void task_closure(i32 index, i8* private) {
  if (index == 0) {
    more_code_may_contain_more_tasks
  } else {
    block
  }
}
```

Parallel loops are basically `parallel_map`s.

Thoughts?

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html
-- 
Sanjoy Das
http://playingwithpointers.com