[cfe-dev] [LLVMdev] [RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Wed Oct 3 11:16:57 PDT 2012

On Wed, 3 Oct 2012 17:30:54 +0100
James Courtier-Dutton <james.dutton at gmail.com> wrote:

> On 3 October 2012 06:17, Chris Lattner <clattner at apple.com> wrote:
> > On Oct 2, 2012, at 11:42 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> As I've stated, whether the metadata is preserved is not really the
> >> relevant metric. It is fine for a pass that does not understand
> >> parallelization metadata to drop it. The important part is that
> >> dropping the metadata, and moving instructions to which that
> >> metadata is attached, must not cause miscompiles. For example:
> >>
> >> - Instructions with unknown side effects or dependencies must not
> >> be moved from outside a parallel region to inside a parallel
> >> region.
> >> - Serialized subregions inside of parallel regions cannot be
> >> deleted without deleting the enclosing parallel region.
> >>
> >> The outstanding proposals have ways of dealing with these things.
> >> In the case of my proposal, it is though cross-referencing the
> >> metadata sufficiently and using function boundaries to prevent
> >> unwanted code motion.
> >
> > I haven't looked at your proposal, but I completely agree in
> > principle that using procedure boundaries is a good way to handle
> > this.
> >
> >> In Intel's case, it is by using the barriers implied by the
> >> intrinsics calls.
> >
> > That's just it - intrinsics using metadata don't imply barriers
> > that would restrict code motion.
> >
> 
> Would another approach be to work from the bottom up.
> Determine the details of what optimizations you wish to be able to do
> on a parallel program and then implement functionallity in the LLVM IR
> to achieve it.
> I.e. A new type of barrier to restrict code motion.
> New barrier types, or special zones where only a subset of machine
> instructions can be used when lowering.
> There are already items in the LLVM IR for atomics, adding a new type
> of barrier might be all that is needed to achieve the optimizations
> wished for.

Agreed.

Generally speaking, I have two primary requirements:

1. Enabling loop optimizations. To understand the problem, consider the
following:

#pragma omp parallel for
for (int i = 0; i < n; ++i) {
  if (i > n) do_a(i, n);
  else do_b(i, n);
}

Under normal circumstances, the compiler would be able to eliminate the
comparisons and simplify the loop. This seems like a silly example, but
in the context of expanding C++ templated code, it happens often. If we
lower the OpenMP constructs too early, then we loose this ability. This
is because such lowering would transform the loop into something like
this:

void __loop1(__cxt *c) {
  int start = __loop_get_start(); // uses TLS
  int end = __loop_get_end(); // uses TLS
  int n = c->n;
  for (int i = start; i < end; ++i) {
    if (i > n) do_a(i, n);
    else do_b(i, n);
  }
}

__loop_in_parallel(__loop1, 0, n);

When optimizing the loop inside the __loop1 function, the relationship
between 'end' and 'n' has been lost, and there is no way for the
compiler to eliminate the comparison in the loop (without a combination
of both IPO and a specific understanding of the runtime calls).

Implementation of other loop optimizations, like fusion and splitting,
also seems to be much more difficult in the presence of early lowering.
Proving non-aliasing of pointers in the context structure might also be
tricky.

2. Enabling target-specific implementations of underlying concepts,
specifically atomics and synchronization, but also thread startup and
handling. The former is important on almost all systems, the latter is,
for the moment, important on embedded and heterogeneous systems. Doing
this without cluttering the frontend with target-specific code would be
nice.

 -Hal

> 
> Kind Regards
> 
> James

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory