[llvm-dev] LoopVectorizer: shufflevectors

Tue Sep 4 05:45:50 PDT 2018

On Tue, 4 Sep 2018 at 13:14, Jonas Paulsson via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> I have run into this in the past also and it surprised me to again see
> (on SystemZ) that the vectorized loop did many seemingly unnecessary
> shuffles.

Hi Jonas,

This is a known side-effect of vectorisers (not just LLVM) that have
to handle multiple hardware architectures. GCC has it's own set of bad
patterns, too.

> This seems to be an issue which is due to keeping instcombine simple and
> fast, as well as a conservativeness to not produce any new shuffles not
> already in the input program (see comment in
> InstCombiner::visitShuffleVectorInst). For some reason a bit unclear to
> me the backend will get into trouble then.

Specifically interleaved generation will invariably lead to
*additional* shuffles, because it's trying to create the pattern that
will, later, be selected into one or few instructions.

The middle-end relies on the back-end knowing how to select the large
patters, as well as other middle-end passes not destroying it.
Cleaning that up may very well lead to poorer code.

> Should improved optimization of shufflevector instructions handle all of
> them globally, or just the new ones produced by the vectorizers?

It's probably a lot simpler to improve the SystemZ model to "know"
have the same arch flags / cost model completeness as the other
targets.

The transformations done by the vectoriser are target-agnostic, but
they still ask the targets if certain patterns are possible and
profitable before doing so.

> Or does this really have to be done on the DAG by each backend? Or
> perhaps this is really just a local issue with the loop vectorizer?

It's a dance between what the middle-end enquires of the target info
and what the back-end can actually generate.

You may have to expose some flags (to turn certain behaviour on), then
to tune the cost model (to make them profitable on most cases), then
implement the pattern recognition in the DAGISel (also GlobalISel), so
that the generated code can be optimally selected.

If LLVM was compiling to a single target, emitting IR that conforms to
one specific pattern would be the most logical choice (don't pollute,
simplify further passes, reduce pattern complexity), so it may sound a
lot simpler on arch-specific compilers.

But in a target-agnostic compiler you need to "emulate" that using the
three-step above: target info, cost model, ISel patterns.

Hope that helps.

-- 
cheers,
--renato