[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Wed Sep 26 10:54:49 PDT 2012

On Wed, 26 Sep 2012 19:06:29 +0530
Sanjoy Das <sanjoy at playingwithpointers.com> wrote:

> Hi,
> 
> Sorry for the hiatus, busy time at my university. :)
> 
> After a false start and some (hopefully cogent) thought, I am now of
> the opinion that it will be better to have llvm natively support a
> somewhat different notion of parallel computation and have the
> frontend lower OpenMP directives (and possibly other such things) into
> the same.
> 
> In short, I propose a intrinsic based approach which hinges on the
> concept of a "parallel map".  The immediate effect of using intrinsics
> is that we no longer have to worry about missing metadata.  Moreover,
> we are still free to lower the intrinsics in a variety of ways --
> including vectorizing them or lowering them to calls to an actual
> openmp backend.
> 
> I have also tried to make this representation more orthogonal and
> general; mirroring a significant subset of openmp directives inside
> llvm's IR doesn't feel right.  For instance, in the following
> proposal, the OpenMP TASK directive is lowered using the more general
> parallel_map construct.  A pass lowering the intrinsics into an OpenMP
> backend may "reverse engineer" the mentioned pattern into tasks when
> possible, but, in principle, I think the directives we introduce into
> llvm's IR are best left as mutually exclusive as possible.  Keeping
> the intrinsics simple and orthogonal should also help in asserting and
> verifying correctness.
> 
> I plan to first implement a null lowering pass which simply lowers
> intrinsics to something semantically correct but with no
> multithreading.  Once that is done, I'll try to lower the intrinsics
> into something more interesting, perhaps libgomp or maybe even a
> custom runtime.
> 
> 
> The proposal
> ------------
> 
> I propose introducing four new intrinsics to llvm:
> 
> 1. void @llvm.parallel_map(i32 limit, void (i32, i8 *) fn, i8* priv)
> 
> Semantics:
> 
> Executes `limit` copies of fn, _possibly_ in parallel.  The map index
> (i32, ranging from 0 to (limit - 1) both inclusive) and `priv` are
> passed to each of the invocations.  The only invariant is that the 0th
> iteration is always executed in the invoking thread.
> 
> It is legal to have calls to parallel_map inside a function being
> parallel_map'ed over.
> 
> 
> 2. void @llvm.sync_region(void (i32, i8 *) fn, i8 type)
> 
> Semantics:
> 
> It is only legal to call sync_region from within the dynamic extent of
> a parallel_map.  It ensures that `limit` copies (the `limit` is the
> limit from the parallel_map) of fn are executed with mutual exclusion.
> 
> `type` can either be 0 (`Any`) signifying that the synchronized
> regions can be run in any order or 1 (`Ordered`) signifying that the
> synchronized regions must be run in increasing order of the index.
> 
> 3. i32 @llvm.get_num_threads()
> 
> Semantics:
> 
> Returns the number of threads in the thread pool.
> 
> 4. i32 @llvm.set_num_threads(i32)
> 
> Set the number of threads in the thread pool.
> 
> 
> 
> It should be possible to lower all OpenMP directives to the above four
> intrinsics in the frontend (please read this in conjunction with [1]):
> 
> 
> Parallel regions can be lowered as a parallel_map with
> @llvm.num_threads as the limit.
> 
> ```
> #pragma PARALLEL
>   block
> ```
> 
> desugars to
> 
> ```
> @llvm.parallel_map(num_threads, block_closure, shared_closure)
> ...
> 
> void block_closure(i32 tid, i8* shared_vars) {
>  block
> }
> ```
> 
> Reductions are handled by a parallel_map followed by a regular
> reduction loop (exactly as in Hal's proposal).
> 
> Serial blocks reduce to a block conditional on the index inside the
> function being parallelly mapped.
> 
> We lower critical and ordered regions into calls to `sync_region`.
> 
> Tasks are lowered (recursively) this way:
> 
> ```
> TASK
>   block
> more_code_may_contain_more_tasks
> TASK_WAIT
> ```
> 
> desugars to
> 
> ```
> @llvm.parallel_map(2, task_closure, task_priv)
> 
> void task_closure(i32 index, i8* private) {
>   if (index == 0) {
>     more_code_may_contain_more_tasks
>   } else {
>     block
>   }
> }
> ```
> 
> Parallel loops are basically `parallel_map`s.
> 
> Thoughts?

Most of this sounds alright, but a few things worry me. My approach was
designed to require minimal changes to the rest of the infrastructure.
In your case you'll need to:

 - Teach LoopInfo, ScalarEvolution, etc. how to 'see' the
   loops spanning the parallel_map calls. You'll need to teach LICM,
   etc. how to do their transformations in the presence of
   parallel_map. This may require that these passes be promoted to
   module-level passes, and have other undesirable consequences.

 - Teach the inliner and other associated passes to
   understand the parallel_map intrinsic. Especially when using the
   null implementation (or an implementation that does not actually
   support task dispatch), small tasks should be considered for
   inlining.

In short, I think that your approach will require a lot more work to
get to a production-quality state than mine. Your approach has the
advantage of a simpler API, but I want to make sure that you've thought
through the various changes to the existing passes that will be
necessary.

Thanks again,
Hal

> 
> [1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory