[PATCH] D14261: [X86][SSE] Recursive search for zeroable shuffle elements

Sun Nov 15 09:09:57 PST 2015

chandlerc added a comment.

In http://reviews.llvm.org/D14261#289721, @RKSimon wrote:

> In http://reviews.llvm.org/D14261#289720, @chandlerc wrote:
>
> > I'm really not convinced this is the correct approach.
>
>
> Is it mainly the recursion that is concerning you? There is more that we could do at the DAGCombiner or X86 level to merge shuffles, but it would still be recursion.

It's not the same at all though. When we combine N shuffles into 1 shuffle, we can do that in O(N) because we can combine as we go. When we instead recursively match during lowering here we can do O(N^2) work.

> Other targets have shuffle instructions that set elements to constants (zero and allbits being the most common) as well as as input permutations, but I don't see much that makes use of this at all.

I'm not really concerned about helping other targets here, I'm concerned about making this use the correct model.

> I've been trying to think of ways to canonicalize shuffles with zeros/constants inputs to reduce this but can't think of anything that would really help.

Can you describe why not? I'm not seeing it.

Specifically, I can't come up with any way that we would need to look through more than N-1 VECTOR_SHUFFLE nodes to find N inputs.

> 

> 

> > Instead, I think we should combine SHUFFLE_VECTOR nodes into a single node so that we don't need this kind of recursive logic. If that doesn't work for some reason, I think that needs to be pretty clearly explained.

> 

> 

> The X86 shuffle combining could do some of this, but it would still be recursive and it would be repeating a lot of what is done at lowering already.

As above, the point is to do *less* of this at lowering time. Even if we still need some of this logic during lowering in order to handle iterative decomposition, we should minimize it and always avoid unbounded recursion.

> insertps is a curious instruction in that it really takes 3 inputs (well, 2 inputs + zero), which makes combining rather tricky with our existing methods - as another approach I could disable recursion in computeZeroableShuffleElements by default and just have the insertps lowering use it?

Ah, I think this is really the critical thing to realize.

Because of the simplification provided by it, we should essentially re-associate shuffles to cause all constant inputs to come from a single build_vector, and that a build_vector of constants should be the last input shuffled:

For example:

  S1 = (shuffle A, (build constant, constant, ...))
  S2 = (shuffle B, S1)

should combine to:

  S1 = (shuffle A, B)
  S2 = (shuffle S1, (build constant, constant, ...))

So that we can always identify constant inputs at the bottom of the shuffle tree. We can still indicate *unused* inputs in S1 with undef shuffle indices.

Does this make sense to you? Is there some *other* thing broken by this style of reassociation?

Repository:
  rL LLVM

http://reviews.llvm.org/D14261