[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Tue Aug 14 00:22:35 PDT 2012

On 08/13/2012 10:54 PM, Hal Finkel wrote:
> I had thought about uses for shared-memory OpenCL implementations, but
> I don't know enough about the use cases to make a specific proposal. Is
> your metadata documented anywhere?

It is now a quick "brute force hack", that's why I got interested in your
proposal. We just wanted to communicate the OpenCL work item information
further down in the compiler as easily as possible and didn't have time
to beautify it.

Now all instructions of the "chained" OpenCL kernel instances
(work items) are annotated with their work item ID, their "parallel region
ID" (from which region between barriers the instruction originates from) and
a sequence ID. So, lots of metadata bloat.

These annotations allow finding the matching instructions later on to
vectorize multiple work items together by just combining the matching
instructions from the different WIs. The alias analyzer uses this
metadata to return NO_ALIAS for any memory access combination where
the accesses are from different work items within the same parallel
region (the specs say if they do alias, the results are undefined,
thus a programmer's fault).

With your annotations this hack could be probably cleaned up by using the
"parallel for loop" metadata which the vectorizer and/or "thread lib call
injector" (or the static instruction scheduler for a VLIW/TTA) can then
use to parallelize the kernel as desired.

I'd remind that its usefulness is not limited to a shared memory
multicore (or even multicore) for the kernel execution device. All
non-SIMT targets require laying out the code for all the work-items
(like they were parallel for loops, unrolled or vectorized or not) for
valid OpenCL kernel execution when there are more than 1 WI per
work-group, thus potentially benefit from this.

> I agree, and this is specifically why I don't want to support OpenMP by
> lowering it into runtime calls in the frontend. I want to allow for
> other optimizations (vectorization, etc.) in combination
> with (or instead of) multi-threading. I think that my current proposal
> allows for that.

Yes it should, as far as I can see. If the loop body is a function and
the iteration count (or its multiple) is known, one should be able to
(vectorize multiple copies of the function without dependence checking.
In the multi-WI OpenCL C case this function would contain the code for a
single work item between a region between barriers (implicit or not).

I'm unsure if forcing the function extraction of the parallel
regions brings unnecessary problems or not. Another option would be to
mark the basic blocks that form parallel regions. Maybe all of the BBs
could be marked with a PR identifier MD? This would require BB
metadata (are they supported?).

>> Also, one user of this metadata could be the alias analysis: it should
>> be easy to write an AA that can exploit the parallelism
>> information. Parallel regions by definition do not have (defined)
>> dependencies between each other (between synchronization points) which
>> should be useful information for optimization purposes even if
>> parallel hardware was not targeted.
>
> I really like this idea! -- and it sounds like you may already have
> something like this in POCL?

Yes, an OpenCL AA that exploits the work-item independence and address
space independence. With your annotations there could be a generic
AA for the "independence information from parallelism metadata" part and
a separate OpenCL-specific AA for the rest.

> Regarding having scheduling be separate, care is required to ensure
> correctness. A large constraint on the design of a metadata API is that

OK, I see.

I suppose it's not a big deal to add the scheduling property. At
least if one (later) allows adding scheduling modes supported by other
standards than OpenMP as well. I.e., not modes like "static" but
"openmp31_static" or similar. For OpenCL work item loops the
scheduling mode could be "auto" or left empty.

> I agree. I think that vectorization is best done earlier in the
> optimization schedule. Vectorization, however, should appropriately
> update loop metadata to allow for proper integration with
> parallelization, etc. Lowering to runtime libraries (for
> multi-threading in whatever form) should be done relatively late in
> the process (because further higher-level optimizations are often not
> possible after that point).

Yes, to enable automatic mixing of vectorization and threading from
the single (data parallel) kernel.

-- 
Pekka