patch: make instcombine remove shuffles by reordering vector elements

Nadav Rotem nrotem at apple.com
Sat May 4 21:21:02 PDT 2013


Hi Hal, 

>> 
>> The problem is that it is really difficult to do that. The search
>> space is huge (think about Mic/XeonPhi shuffles where the vector
>> size is 512 bits), and the instruction set is sparse. 
> 
> Can you further justify this?

Thanks for bringing this up. The 2012 CGO keynote discusses this problem and it also mentions that the perfect-shuffle tables can't be used for X86.

http://www.nondot.org/sabre/2012-04-02-CGOKeynote.pdf

It does not mention 512 or 256 bit vectors explicitly, but we can do the math. On 512 bits vector each of the 64 i8 elements can choose one of 64+64+1 possibilities. We can't store all of the possibilities in a table, not to mention that different instruction sets require different tables.

> I can certainly see saying that, given a set of N shuffles, finding an optimal decomposition into M shuffles of a restricted form is hard. It is not clear to me, however, that this is really the problem that needs to be solved. AVX cross-lane shuffles are expensive, but are they really so expensive that, say, one cross-lane shuffle is not preferable to three or four cheaper shuffles?

Cross-lane shuffles on AVX can only be done using multiple shuffles. 

> Maybe we only need to search for decompositions into two or three cheaper shuffles, and that should be doable.
> 
> And if the search space is still too large for <i1 x 512> shuffles, it still might be approachable for shuffles of doubles, floats, etc.

We only need to worry about legal types. So, i8 and above. 

> 
> Why not use TTI here? The implementation might be to slow for use in InstCombine, but are there other issues?
> 

Even at the codegen level we can't estimate the cost of shuffles.  But even if we could estimate the cost of shuffles, InstCombine is not the right place for this kind of transformations. InstCombine should canonicalize the code, and not lower it. Only the late passes, such as LSR and CGP should use TTI. Jakob mentioned this in his LLVM Euro talk last week. 

Thanks,
Nadav
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130504/793e329d/attachment.html>


More information about the llvm-commits mailing list