[LLVMdev] Is there pass to break down <4 x float> to scalars

Fri Oct 25 08:10:21 PDT 2013

On 10/25/2013 05:38 PM, Richard Sandiford wrote:
> Sounds like a nice compromise if it could be made to work.  Would it be
> LLVM that reverts to the autovectorised version, or pocl?

In my opinion LLVM, because this benefits not only the OpenCL WG
autovectorization of pocl, but any code that uses explicit vector instructions 
and might be more efficiently autovectorized if those were devectorized first. 
E.g. C code that uses the vector datatypes using
the Clang's attributes.

> Yeah, I vaguely remember some objections to handling target-specific
> intrinsics at the IR level, which I heard is what put others off doing
> the pass.  In my case life is much simpler: there are no intrinsics
> and there's no native vector support.  So in some ways I've only done
> the easy bit.  I'm just hoping it's also the less controversial bit.

One solution is to try to scalarize the intrinsic calls
too (if one knows of the matching ones), and if it fails, keep them
intact (potentially leads to additional unpack/pack etc. overheads if
the autovectorization of them fails).

> Do the OpenCL loops that don't get vectorised (because they already
> have some vector ops) also contain vector intrinsics, or is it usually
> generic vector IR?  Would a pass that just scalarises the generic
> operations but keeps intrinsics as-is be any use to you, or would the
> intrinsics really need to be handled too?

Yes to both. It is useful without the intrinsics support, but the
above handling might improve the results for some kernels.

OpenCL builtins (math functions) have vector versions so they are
called if one uses the vector data types. Then sometimes one ends up
having vector instrinsics in the bitcode.

In case I wasn't clear, there are two dimensions on how one can
autovectorize the OpenCL C kernels: inside the SPMD kernel
descriptions (single work-item) itself, or the "implicit parallel loops"
across all work-items in the work-group. I was referring to the latter
as that is where the massive data parallelism and, thus more scalable
vectorization opportunities, usually are.

-- 
Pekka