[LLVMdev] Is there pass to break down <4 x float> to scalars

Fri Oct 25 07:38:54 PDT 2013

Pekka Jääskeläinen <pekka.jaaskelainen at tut.fi> writes:
> E.g., the last time I checked, the inner loop vectorizer (which pocl exploits)
> just refused to vectorize loops with vector instructions. It might not
> be so drastic with the SLP or the BB vectorizer, but in general, it might
> make sense to let the vectorizer to do the decisions on how to map the
> parallel (scalar) operations best to the vector hardware, and just help it
> with the parallelism knowledge propagated from the parallel program.
> One can then fall back to the original (hand vectorized) code in case
> the autovectorization failed, to get some vector hardware utilization
> still.

Sounds like a nice compromise if it could be made to work.  Would it be
LLVM that reverts to the autovectorised version, or pocl?

> On 10/25/2013 04:15 PM, Richard Sandiford wrote:
>> To be honest I hadn't really thought about targets with vector units
>> at all.:-)   I was just assuming that we'd want to keep vector operations
>> together if there's native support.  E.g. ISTR comments about not wanting
>> to rewrite vec_selects because it can be hard to synthesise optimal
>> sequences from a single canonical form.  But I might have got that wrong.
>> Also, llvmpipe uses intrinsics for some things, so it might be strange
>> if we decompose IR operations but leave the intriniscs alone.
>
> The issue of intrinsics and vectorization was discussed some time ago.
> There it might be better to devectorize to a scalar version of the
> instrinsics (if available) as at least the loopvectorizer can vectorize
> also a set of selected intrinsics, and the target might have direct
> machine instructions for those (which could not be exploited easily from
> "inlined" versions).

Yeah, I vaguely remember some objections to handling target-specific
intrinsics at the IR level, which I heard is what put others off doing
the pass.  In my case life is much simpler: there are no intrinsics
and there's no native vector support.  So in some ways I've only done
the easy bit.  I'm just hoping it's also the less controversial bit.

Do the OpenCL loops that don't get vectorised (because they already
have some vector ops) also contain vector intrinsics, or is it usually
generic vector IR?  Would a pass that just scalarises the generic
operations but keeps intrinsics as-is be any use to you, or would the
intrinsics really need to be handled too?

Thanks for the feedback.

Richard