[llvm-dev] Data structure improvement for the SLP vectorizer

Tue Mar 14 18:00:46 PDT 2017

There was some discussion of this on the llvm-commits list, but I
wanted to raise the topic for discussion here. The background of the
-commits discussion was that r296863 added the ability to sort memory
access when the SLP vectorizer reached a load (the SLP vectorizer
starts at a store or some other sink, and tries to go up the tree
vectorizing as it goes along - if the input is in a different order
than the output, you need a shufflevector somewhere in the middle).
However, that commit had to be reverted because the SLP tree is not
actually a tree and the same load bundle could be used by several
different users, causing problems with representing the required
shuffle on the load bundle node (Please add any history I'm missing
here).

Now, I'm jumping into this because I want to teach the vectorizer
about a similar operation, namely a 2->4 (or 4->8 for AVX512)
broadcast. This operation also requires a load followed by a
shufflevector, so it would be nice to use the same representation for
both of these.

Now, the primary problem is that we don't really have anywhere to put
this permutation information. The two solutions seem to be duplicating
the nodes (and maybe deduplicating them later again when materializing
the tree), or taking the hit and representing the graph more
explicitly, such that we can annotate information on the actual edges
of the graph.

The latter seems nicer to me, we'd essentially have a DAG of scalar
bundles with vector shuffles in between them. Right now the vectorizer
can't really handle intermediate vector shuffles at all as far as I
can tell (with a few hand coded exceptions for common patterns). From
my perspective that seems a bit odd. AVX512 for example can represent
most (all?) shufflevector operations in one instruction, so failing to
vectorize a large tree because it would require one of them seems very
problematic. This is also related to my desire for vector->larger
vector broadcasting. Especially when targeting AVX512, there can be
whole 4-wide subtrees that the SLP vectorizer isn't currently looking
at, because it can't handle the 4->8 expansion. Whatever the new
representation, I'd like the SLP vectorizer to be able to handle that.

I hope that summarizes the problem. My understanding of the situation
is admittedly incomplete - I came to this because I need the
vectorizer to correctly handle my narrow-width subtrees, but I'm
hoping we can collectively come up with a design that would cover all
the relevant use cases.