[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Mon Aug 13 12:54:52 PDT 2012

On Mon, 13 Aug 2012 12:38:02 +0300
Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> wrote:

> Hi,
> 
> On 08/10/2012 11:06 PM, Hal Finkel wrote:
>  > I'd like to see support in clang/LLVM for multi-core parallelism,
>  > especially support for OpenMP. I think that the best way to do
>  > this is by designing an LLVM-based API (metadata and intrinsics)
>  > for expressing parallelism constructs, and having clang lower
>  > OpenMP code to that API. This will allow maximal preservation of
>  > optimization capabilities including target-specific lowering. What
>  > follows outlines a set of metadata and intrinsics which should
>  > allow support for the full OpenMP specification, and I'd like to
>  > know what the community thinks about this.
> 
> Something like this would be useful also for OpenCL C
> work group parallelization. At the moment in pocl we do this

I had thought about uses for shared-memory OpenCL implementations, but
I don't know enough about the use cases to make a specific proposal. Is
your metadata documented anywhere?

> in a
> hackish way with an "overkill" OpenCL C-specific metadata that is fed
> to a modified bb-vectorizer of yours for autovectorization and
> a custom alias analyzer for AA benefits.
> 
> I'd like to remind that multithreading is just one option on how
> to map the "parallel regions/loops" in parallel programs to parallel
> hardware. Within a single core, vectorization/DLP (SIMD/vector
> extensions) and static ILP (basically VLIW) are the other interesting
> ones. In order to exploit all the parallel resources one could try to
> intelligently combine the mapping over all of those.

I agree, and this is specifically why I don't want to support OpenMP by
lowering it into runtime calls in the frontend. I want to allow for
other optimizations (vectorization, etc.) in combination
with (or instead of) multi-threading. I think that my current proposal
allows for that. 

> 
> Also, one user of this metadata could be the alias analysis: it should
> be easy to write an AA that can exploit the parallelism
> information. Parallel regions by definition do not have (defined)
> dependencies between each other (between synchronization points) which
> should be useful information for optimization purposes even if
> parallel hardware was not targeted.

I really like this idea! -- and it sounds like you may already have
something like this in POCL?

> 
> > - Loops -
> >
> > Parallel loops are indicated by tagging all backedge branches with
> > 'parallel' metadata. This metadata has the following entries:
> >    - The string "loop"
> >    - A metadata reference to the parent parallel-region metadata
> >    - Optionally, a string specifying the scheduling mode: "static",
> > "dynamic", "guided", "runtime", or "auto" (the default)
> >    - Optionally, an integer specifying the number of loop levels
> > over which to parallelize (the default is 1)
> >    - If applicable, a list of metadata references specifying
> > ordered and serial/critical regions within the loop.
> 
> IMHO the generic metadata used to mark parallelism (basically to
> denote independence of iterations in this case) should be separated
> from OpenMP- specific ones such as the scheduling mode. After all,
> there are and will be more of parallel programming
> languages/standards in the future than just OpenMP that could
> generate this new metadata and get the mapping to the parallel
> hardware (via thread library calls or autovectorization, for example)
> automagically.

I think that making the metadata more modular sounds like a good idea.

Regarding having scheduling be separate, care is required to ensure
correctness. A large constraint on the design of a metadata API is that
different pieces of metadata can be independently dropped by
transformation passes, and that must be made safe w.r.t. the correctness
of the code. For example, if a user specified that an OpenMP loop is to
be parallelized with runtime scheduling, then if an OpenMP parallel loop
is generated, we need to be sure to honor the runtime scheduling mode.
I've tried propose metadata with a sufficient amount of
cross-referencing so that dropping any piece of metadata will preserve
correctness (even if that means loosing a parallel region).

> 
> > -- Late Passes (Lowering) --
> >
> > The parallelization lowering will be done by IR level passes in
> > CodeGen prior to SelectionDAG conversion. Currently, this means
> > after loop-strength reduction. Like loop-strength reduction, these
> > IR level passes will get a TLI object pointer and will have
> > target-specific override capabilities.
> >
> > ParallelizationCleanup - This pass will be scheduled prior to the
> > other parallelization lowering passes (and anywhere else we
> > decide). Its job is to remove parallelization metadata that had
> > been rendered inconsistent by earlier optimization passes. When a
> > parallelization region is removed, any parallelization intrinsics
> > that can be removed are then also removed.
> >
> > ParallelizationLowering - This pass will actual lower paralleliztion
> > constructs into a combination of runtime-library calls and,
> > optionally, target-specific intrinsics. I think that an initial
> > generic implementation will target libgomp.
> 
> A vectorization pass could trivially vectorize parallel loops
> without calls etc. here.

I agree. I think that vectorization is best done earlier in the
optimization schedule. Vectorization, however, should appropriately
update loop metadata to allow for proper integration with
parallelization, etc. Lowering to runtime libraries (for
multi-threading in whatever form) should be done relatively late in
the process (because further higher-level optimizations are often not
possible after that point).

Thanks for your comments! Please feel free to propose specific metadata
forms and/or intrinsics to capture your ideas; then we can work on
combining them.

 -Hal

> 
> BR,

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory