patch: make instcombine remove shuffles by reordering vector elements

Mon May 6 22:44:52 PDT 2013

On May 6, 2013, at 4:32 PM, Sean Silva <silvas at purdue.edu> wrote:
> There is a nice lattice structure on the set of shuffle compositions (roughly, "instruction sequences") imposed by the prefix relation. Also, the above cost function should be monotonic on this lattice since shuffle costs are strictly positive. At the very least, the lattice structure together with a cost function monotonic on the lattice should allow some amount of pruning over a brute force search. A cost function like this also would be amenable to A* search (although I have no clue what heuristic function would be appropriate).
> 
> That's all I was able to come up with today (while waiting at the DMV...). I still need to think some more about how to impose more structure (than just "black box" functions) on the set of all possible shuffles (and hence on the shuffle classes). Primarily, this structure would enable reasoning about "dead ends" when composing shuffles. For example with AVX, if the desired shuffle does not move elements across 128-bit lanes, then all PSHUFB shuffles that cross lanes are "dead ends" in the search space (this applies more to the problem of computing perfect shuffles online, rather than building tables).

This is an interesting way to look at it.  Here is a related, but different way, with some backstory:

The perfect shuffle code does a really good job on PPC, for 4-element vectors.  This width is really important for GLSL, OpenCL and related graphics technologies (so it is an important special case) but doesn't handle PPC's other cases, including v16i8.  As others have noted, "perfect shuffle" doesn't work in the general case because of huge table size... but what if we changed the tables?

At least on PPC, perfect shuffle is only used when a shuffle has a cost of 3 or less.  If you look at the table that gets indexed into, we have:

// 31 entries have cost 0
// 292 entries have cost 1
// 1384 entries have cost 2
// 3061 entries have cost 3
// 1733 entries have cost 4
// 60 entries have cost 5

... which means that ~73% of the table entries are actually used, which makes this a good encoding for PPC.  If you bump out the table to a vector width of 8 or wider though, I have not proven this, but I suspect that the vast majority of the table would be dead: it may be possible to generate every crazy shuffle, but the generated code would never be used, so it isn't worth storing in the table.

It would be an interesting experiment to bump the table size to 8 or 16, then see how many entries are live.  Switching the table to a sparse format (where each "row" of the table would include the shuffle mask as well as how to generate the shuffle) and then binary search the table when doing a lookup.  The table would certainly be smaller than the current encoding, but it isn't clear to me just how much smaller it would be (particularly given undefs in lanes, which make a lot of shuffles "easy").

If this was promising, the next step would be to look at canonicalizing the input shuffle.  It doesn't make sense to store both <4, 0, 0, 1> and <0, 4, 4, 5> for example (and why is perfect shuffle doing this??).  Maybe with a binary search and sparse representation we could get away with storing *just* the concrete shuffles, then having a lookup with undef elements pick the best shuffle from the concrete shuffles that it matches...

If that was working and compact enough, then you could start working in other sorts of operations and other sorts of vector lane elements (like loads, constant zeros, etc).  I think that lots of cleverness could be applied to this. :-)

-Chris