patch: make instcombine remove shuffles by reordering vector elements

Sun May 5 07:15:13 PDT 2013

----- Original Message -----
> From: "Nadav Rotem" <nrotem at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "llvm-commits LLVM" <llvm-commits at cs.uiuc.edu>, "Nick Lewycky" <nicholas at mxc.ca>
> Sent: Saturday, May 4, 2013 11:21:02 PM
> Subject: Re: patch: make instcombine remove shuffles by reordering vector elements
> 
> 
> Hi Hal,
> 
> 
> 
> 
> 
> 
> 
> 
> The problem is that it is really difficult to do that. The search
> space is huge (think about Mic/XeonPhi shuffles where the vector
> size is 512 bits), and the instruction set is sparse.
> 
> Can you further justify this?
> 
> 
> 
> Thanks for bringing this up. The 2012 CGO keynote discusses this
> problem and it also mentions that the perfect-shuffle tables can't
> be used for X86.
> 
> 
> http://www.nondot.org/sabre/2012-04-02-CGOKeynote.pdf
> 
> 
> It does not mention 512 or 256 bit vectors explicitly, but we can do
> the math. On 512 bits vector each of the 64 i8 elements can choose
> one of 64+64+1 possibilities. We can't store all of the
> possibilities in a table, not to mention that different instruction
> sets require different tables.
> 

Agreed.

> 
> 
> 
> 
> I can certainly see saying that, given a set of N shuffles, finding
> an optimal decomposition into M shuffles of a restricted form is
> hard. It is not clear to me, however, that this is really the
> problem that needs to be solved. AVX cross-lane shuffles are
> expensive, but are they really so expensive that, say, one
> cross-lane shuffle is not preferable to three or four cheaper
> shuffles?
> 
> 
> Cross-lane shuffles on AVX can only be done using multiple shuffles.

I suppose this is what your 1000 lines of code does. Are there some useful comments somewhere that describe the current heuristics?

> 
> 
> 
> Maybe we only need to search for decompositions into two or three
> cheaper shuffles, and that should be doable.
> 
> And if the search space is still too large for <i1 x 512> shuffles,
> it still might be approachable for shuffles of doubles, floats, etc.
> 
> 
> 
> We only need to worry about legal types. So, i8 and above.
> 
> 
> 
> 
> 
> Why not use TTI here? The implementation might be to slow for use in
> InstCombine, but are there other issues?
> 
> 
> 
> Even at the codegen level we can't estimate the cost of shuffles. But
> even if we could estimate the cost of shuffles, InstCombine is not
> the right place for this kind of transformations. InstCombine should
> canonicalize the code, and not lower it. Only the late passes, such
> as LSR and CGP should use TTI. Jakob mentioned this in his LLVM Euro
> talk last week.

I agree with this 99.9%, but this may be one exception worth considering. The problem is that is may not be possible to choose a useful canonical form for shuffles because of effective information loss. Different targets have different requirements, and it seems as though it might be computationally impractical move in between the different preferred forms (this is the key factor). As a result, we cannot choose a canonical form that is not actively harmful to some targets (I consider not performing general shuffle combination actively harmful to targets that can efficiently represent arbitrary shuffles -- although I admit that there are also register-pressure effects to consider).

Thanks again,
Hal

> 
> 
> Thanks,
> Nadav