[LLVMdev] LoopVectorizer in OpenCL C work group autovectorization

Fri Jan 25 11:18:41 PST 2013

Hi Nadav,

On 01/25/2013 07:11 PM, Nadav Rotem wrote:
> What you need is outer loop vectorization while the loop vectorizer is an
> inner loop vectorizer. If you decide to use the Loop Vectorizer then you
> won't be able to vectorize kernels that have inner loops or kernels that have
> barriers in them. If you look at the AMD OpenCL SDK you will see that most of
> the workloads have barriers, inner loops.

Barriers are the problem of the "parallel region formation phase" of
pocl. It's a distinct problem from the actual parallelization method
such as vectorization (or, e.g., unroll+VLIW schedule).

Non-divergent iteration count kernel loops can be executed in lock step
and also vectorized. The parallel region/wiloop can be formed
inside the kernel loop which can be then vectorized.

kernel_for_loop {
    parallel_wiloop over x {
      .. the original kernel loop body ..
    }
}

Vectorizing divergent loops needs masking or similar, e.g., as presented
in the WFV paper, but this doesn't need to be an OpenCL specific optimization
as it helps vectorization in general.

It's a case of the inner-loop iteration count depending on the outer
loop.

parallel_wiloop over x {
    kernel_for_loop i := 0...x { // or similar variable range depending on x
      ...
    }
}

to

kernel_for_loop {
    parallel_wiloop {
       // the whole body predicated with the kernel_for_loop condition
       // that includes 'x' somewhere
    }
}

> Another problem that you may run into is 'early exits'.  In many kernels you
> will see something like  " if (get_global_id(0)>  N) return; "

Now in pocl this ends up being a parallel region similar to this:

parallel_wiloop over x {
    if (x > N) goto ret;
    ... kernel code here
ret:
}

Not the easiest case to parallelize but might be doable because N
can be used to modify the wiloop iteration range.

parallel_wiloop x:= 0...N-1 {
    ... kernel code here
}

Anyways, one cannot be expected to defeat all the bad kernel coding practices.

> Not to mention that it will be very important for you to vectorize function
> calls. Vectorization needs to happen before inlining because you don't want
> to vectorize a cloud of instructions when you can convert a single function
> call.  Think about image samplers or slightly more complex builtins that have
> control flow in them.

I think function calls are one thing, builtins/intrinsics another.

Vectorizing builtins is something that is partially OpenCL specific (if the
builtins itself are OpenCL-specific), but I think there should be benefit in
a generic implementation of that case also. I.e., converting builtin/intrinsics
calls to their vector counterparts, if available.

Say,

for_loop {
   call @llvm.sinf32(a[x]);
   ...
}

is useful to be vectorizable if the target ISA can do SIMD sinf.

In any case, it's clear some kernels are not vectorizable (at least
beneficially so), e.g. due to non-predicateable (huh!) control flow, but
that does not have much to do with the actual vectorizing method or the input
language used.

BR,
-- 
Pekka