[LLVMdev] parallel loop metadata simplification

Sun Mar 3 02:29:27 PST 2013

On 03/02/2013 08:44 PM, Tobias Grosser wrote:
> If the use of ivdep is correct, it seems necessary to _not_ annotate the loads
> and stores from and to 't'. Only after 't' is moved into a register, the loop is
> actually parallel on the IR level.

I didn't realize this is a problem in general because in pocl we
explicitly "privatize" the OpenCL C kernel private variables in the
generated work-item loops. In this case, the 't' would not be a scalar,
but a per-iteration variable in an array with an element for each iteration or
a private vreg.

We have two cases for the kernel private variables in OpenCL C:

1) Variables accessed outside one loop (private variables that
span multiple parallel regions). These need to be allocated to function
scope per-iteration (work-item) arrays.
2) Temporary variables accessed only inside one loop. These can stay as loop
private variables. No need to allocate stack space for all the iterations at
once, but for only those executed in parallel.

Seems Hal was right that allocas need to be treated specially. But
how to deal with this type of cases without excluding the parallel-safe alloca
cases? E.g. a programmer-written array (in stack) that is accessed safely from
the parallel iterations or scalars that are only read inside the loop?

The general problem is that parallel loops, due to their differing
semantics, should be treated differently from standard serial C loops during
the Clang codegen to avoid cases like this properly. E.g., the alloca in this
case should not be simply pushed outside the loop because it converts the
parallel loop to a serial one. Instead, each iteration should have their own
private instance of the loop body scope temporary variables.

As a conclusion, at this state, it is not safe to just blindly annotate all
memory accesses with the llvm.mem.parallel_access. It seems quite easy to
produce broken code that way. The easy way forward is to skip marking allocas
altogether and hope mem2reg/SROA makes the loop parallel, but unfortunately it
serializes some of the valid parallel loop cases too. Improved version would
generate loop-scope (temporary) variables in a parallel-loop aware way.

BTW I noted Clik Plus has actually two different parallel loop constructs.
Have you, Cilk Plus developers, thought about the parallel loop code
generation yet?

BR,
-- 
--Pekka