patch: make instcombine remove shuffles by reordering vector elements

Sun May 5 07:24:23 PDT 2013

----- Original Message -----
> From: "Hal Finkel" <hfinkel at anl.gov>
> To: "Nadav Rotem" <nrotem at apple.com>
> Cc: "llvm-commits LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Sunday, May 5, 2013 9:15:13 AM
> Subject: Re: patch: make instcombine remove shuffles by reordering vector	elements
> 
> 
> 
> ----- Original Message -----
> > From: "Nadav Rotem" <nrotem at apple.com>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: "llvm-commits LLVM" <llvm-commits at cs.uiuc.edu>, "Nick Lewycky"
> > <nicholas at mxc.ca>
> > Sent: Saturday, May 4, 2013 11:21:02 PM
> > Subject: Re: patch: make instcombine remove shuffles by reordering
> > vector elements
> > 
> > 
> > Hi Hal,
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > The problem is that it is really difficult to do that. The search
> > space is huge (think about Mic/XeonPhi shuffles where the vector
> > size is 512 bits), and the instruction set is sparse.
> > 
> > Can you further justify this?
> > 
> > 
> > 
> > Thanks for bringing this up. The 2012 CGO keynote discusses this
> > problem and it also mentions that the perfect-shuffle tables can't
> > be used for X86.
> > 
> > 
> > http://www.nondot.org/sabre/2012-04-02-CGOKeynote.pdf
> > 
> > 
> > It does not mention 512 or 256 bit vectors explicitly, but we can
> > do
> > the math. On 512 bits vector each of the 64 i8 elements can choose
> > one of 64+64+1 possibilities. We can't store all of the
> > possibilities in a table, not to mention that different instruction
> > sets require different tables.
> > 
> 
> Agreed.
> 
> > 
> > 
> > 
> > 
> > I can certainly see saying that, given a set of N shuffles, finding
> > an optimal decomposition into M shuffles of a restricted form is
> > hard. It is not clear to me, however, that this is really the
> > problem that needs to be solved. AVX cross-lane shuffles are
> > expensive, but are they really so expensive that, say, one
> > cross-lane shuffle is not preferable to three or four cheaper
> > shuffles?
> > 
> > 
> > Cross-lane shuffles on AVX can only be done using multiple
> > shuffles.
> 
> I suppose this is what your 1000 lines of code does. Are there some
> useful comments somewhere that describe the current heuristics?
> 
> > 
> > 
> > 
> > Maybe we only need to search for decompositions into two or three
> > cheaper shuffles, and that should be doable.
> > 
> > And if the search space is still too large for <i1 x 512> shuffles,
> > it still might be approachable for shuffles of doubles, floats,
> > etc.
> > 
> > 
> > 
> > We only need to worry about legal types. So, i8 and above.
> > 
> > 
> > 
> > 
> > 
> > Why not use TTI here? The implementation might be to slow for use
> > in
> > InstCombine, but are there other issues?
> > 
> > 
> > 
> > Even at the codegen level we can't estimate the cost of shuffles.
> > But
> > even if we could estimate the cost of shuffles, InstCombine is not
> > the right place for this kind of transformations. InstCombine
> > should
> > canonicalize the code, and not lower it. Only the late passes, such
> > as LSR and CGP should use TTI. Jakob mentioned this in his LLVM
> > Euro
> > talk last week.
> 
> I agree with this 99.9%, but this may be one exception worth
> considering. The problem is that is may not be possible to choose a
> useful canonical form for shuffles because of effective information
> loss. Different targets have different requirements, and it seems as
> though it might be computationally impractical move in between the
> different preferred forms (this is the key factor). As a result, we
> cannot choose a canonical form that is not actively harmful to some
> targets (I consider not performing general shuffle combination
> actively harmful to targets that can efficiently represent arbitrary
> shuffles -- although I admit that there are also register-pressure
> effects to consider).

Another way of looking at it is this: Combining shuffles is not really a matter of canonicalization. *shuffle* is the canonical form, as opposed to representing the operation using insert/extract_elmt pairs, or using bit-wise operations, etc. Beyond that, just like with other vectorization matters, we need target information to make useful decisions.

 -Hal

> 
> Thanks again,
> Hal
> 
> > 
> > 
> > Thanks,
> > Nadav
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>