patch: make instcombine remove shuffles by reordering vector elements

Sat May 4 18:27:31 PDT 2013

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Nick Lewycky" <nicholas at mxc.ca>
> Cc: "llvm-commits LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Saturday, May 4, 2013 3:25:25 PM
> Subject: Re: patch: make instcombine remove shuffles by reordering vector	elements
> 
> 
> Hi Nick,
> 
> 
> 
> 
> 
> I don't still have the motivating C++ source. A while ago when
> looking at .ll files produced with the loop vectorizer enabled, I
> noticed redundant shuffles and made a note to myself to write this
> optimization.
> 
> 
> 
> Thanks for bringing this up. One pattern that your patch can optimize
> is vectorization of loops with reverse iterators. For example:
> 
> 
> for (i=n; i>0; i--) {
> A[i] = B[i] * 2;
> }
> 
> 
> In this loop we will load the memory, reverse it, perform the
> operation, reverse it again, and store it. I am in favor of teaching
> inst-combine to remove both shuffles.
> 
> 
> 
> 
> The good news is that when this transform fires it always deletes one
> shuffle. We can still run the transform only to reorder
> insertelement indices (make it return false for all shufflevectors
> in CanEvaluateShuffled).
> 
> 
> It would be great if you could write your optimization in a way that
> it would only delete shuffles (it should not create new shuffles).
> Also, I would like to limit the recursion depth to make sure we
> don't spend too much time on one instruction. I am also okay with
> re-ordering insertelement instructions, but I am not sure how common
> they are.
> 
> 
> 
> If the backend is capable of lowering a bad shuffle to multiple
> shuffles instead of scalarizing then we can conclude that this patch
> at worst moves shuffles around.
> 
> If the backend doesn't do that but we have some way of determining
> whether a shuffle is good or bad, we can use that to ensure we don't
> fold two good shuffles into a single bad one.
> 
> 
> 
> 
> The problem is that it is really difficult to do that. The search
> space is huge (think about Mic/XeonPhi shuffles where the vector
> size is 512 bits), and the instruction set is sparse. 

Can you further justify this? I can certainly see saying that, given a set of N shuffles, finding an optimal decomposition into M shuffles of a restricted form is hard. It is not clear to me, however, that this is really the problem that needs to be solved. AVX cross-lane shuffles are expensive, but are they really so expensive that, say, one cross-lane shuffle is not preferable to three or four cheaper shuffles? Maybe we only need to search for decompositions into two or three cheaper shuffles, and that should be doable.

And if the search space is still too large for <i1 x 512> shuffles, it still might be approachable for shuffles of doubles, floats, etc.

> We lower x86
> shuffles with 1000 lines of c++ code.

Maybe that's not so bad ;) The PPC has a whole perfect-shuffle generation framework to handle these kinds of things for Altivec. Have you ever looked at PPCPerfectShuffle.h and utils/PerfectShuffle/PerfectShuffle.cpp? 

> So, there is no way to
> determine ahead of time if the shuffle is good or bad. Also,
> InstCombine is not the right place to do these kind of
> transformation because they are very target specific.

Why not use TTI here? The implementation might be to slow for use in InstCombine, but are there other issues?

Thanks again,
Hal

> 
> 
> 
> Thanks,
> Nadav
> 
> 
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>