[LLVMdev] Is there pass to break down <4 x float> to scalars

Fri Oct 25 06:15:20 PDT 2013

Renato Golin <renato.golin at linaro.org> writes:
> On 25 October 2013 11:06, Richard Sandiford <rsandifo at linux.vnet.ibm.com>wrote>> It would also need some TargetTransformInfo hooks to decide which
>> vectors should be decomposed.
>
> If I got it right, this may not be necessary, or it may even be harmful.
>
> Say you decide that <4 x i32> vectors should be left alone, so that your
> pass only scalarise the others. But when the vectorizer passes again (to
> try and use CPU vector instructions), it might not match the scalarised
> version with the vector, and you end up with data movement between scalar
> and vector pipelines, which normally slows down CPUs (at least in ARM's
> case). Also, problematic cases like <5 x i32> could be better split into
> 3+2 pairs, rather than 4+1.
>
> If you scalarise everything, than the vectorizers will have a better chance
> of spotting patterns and vectorising the whole lot, then based on target
> transform info.
>
> Is that what you had in mind?

To be honest I hadn't really thought about targets with vector units
at all. :-)  I was just assuming that we'd want to keep vector operations
together if there's native support.  E.g. ISTR comments about not wanting
to rewrite vec_selects because it can be hard to synthesise optimal
sequences from a single canonical form.  But I might have got that wrong.
Also, llvmpipe uses intrinsics for some things, so it might be strange
if we decompose IR operations but leave the intriniscs alone.

I'd half wondered whether, as an extension, the pass should split wide
vectors into supported widths.  I hadn't thought about the possiblity of
decomposing everything and them reassembling it though.  I can see how
that would cope with more cases, like you say.

Thanks,
Richard