[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Mon Aug 13 02:38:02 PDT 2012

Hi,

On 08/10/2012 11:06 PM, Hal Finkel wrote:
 > I'd like to see support in clang/LLVM for multi-core parallelism,
 > especially support for OpenMP. I think that the best way to do this is
 > by designing an LLVM-based API (metadata and intrinsics) for
 > expressing parallelism constructs, and having clang lower OpenMP code
 > to that API. This will allow maximal preservation of optimization
 > capabilities including target-specific lowering. What follows outlines
 > a set of metadata and intrinsics which should allow support for the
 > full OpenMP specification, and I'd like to know what the community
 > thinks about this.

Something like this would be useful also for OpenCL C
work group parallelization. At the moment in pocl we do this in a
hackish way with an "overkill" OpenCL C-specific metadata that is fed
to a modified bb-vectorizer of yours for autovectorization and
a custom alias analyzer for AA benefits.

I'd like to remind that multithreading is just one option on how
to map the "parallel regions/loops" in parallel programs to parallel
hardware. Within a single core, vectorization/DLP (SIMD/vector extensions)
and static ILP (basically VLIW) are the other interesting ones. In order
to exploit all the parallel resources one could try to intelligently
combine the mapping over all of those.

Also, one user of this metadata could be the alias analysis: it should
be easy to write an AA that can exploit the parallelism
information. Parallel regions by definition do not have (defined)
dependencies between each other (between synchronization points) which
should be useful information for optimization purposes even if
parallel hardware was not targeted.

> - Loops -
>
> Parallel loops are indicated by tagging all backedge branches with
> 'parallel' metadata. This metadata has the following entries:
>    - The string "loop"
>    - A metadata reference to the parent parallel-region metadata
>    - Optionally, a string specifying the scheduling mode: "static",
> "dynamic", "guided", "runtime", or "auto" (the default)
>    - Optionally, an integer specifying the number of loop levels over
> which to parallelize (the default is 1)
>    - If applicable, a list of metadata references specifying ordered and
> serial/critical regions within the loop.

IMHO the generic metadata used to mark parallelism (basically to denote
independence of iterations in this case) should be separated from OpenMP-
specific ones such as the scheduling mode. After all, there are and will be
more of parallel programming languages/standards in the future than just
OpenMP that could generate this new metadata and get the mapping to the
parallel hardware (via thread library calls or autovectorization, for
example) automagically.

> -- Late Passes (Lowering) --
>
> The parallelization lowering will be done by IR level passes in CodeGen
> prior to SelectionDAG conversion. Currently, this means after
> loop-strength reduction. Like loop-strength reduction, these IR level
> passes will get a TLI object pointer and will have target-specific
> override capabilities.
>
> ParallelizationCleanup - This pass will be scheduled prior to the other
> parallelization lowering passes (and anywhere else we decide). Its job
> is to remove parallelization metadata that had been rendered
> inconsistent by earlier optimization passes. When a parallelization
> region is removed, any parallelization intrinsics that can be removed
> are then also removed.
>
> ParallelizationLowering - This pass will actual lower paralleliztion
> constructs into a combination of runtime-library calls and, optionally,
> target-specific intrinsics. I think that an initial generic
> implementation will target libgomp.

A vectorization pass could trivially vectorize parallel loops
without calls etc. here.

BR,
-- 
Pekka