patch: make instcombine remove shuffles by reordering vector elements

Wed May 8 00:03:35 PDT 2013

On Mon, May 6, 2013 at 11:44 PM, Chris Lattner <clattner at apple.com> wrote:

> On May 6, 2013, at 4:32 PM, Sean Silva <silvas at purdue.edu> wrote:
> > There is a nice lattice structure on the set of shuffle compositions
> (roughly, "instruction sequences") imposed by the prefix relation. Also,
> the above cost function should be monotonic on this lattice since shuffle
> costs are strictly positive. At the very least, the lattice structure
> together with a cost function monotonic on the lattice should allow some
> amount of pruning over a brute force search. A cost function like this also
> would be amenable to A* search (although I have no clue what heuristic
> function would be appropriate).
> >
> > That's all I was able to come up with today (while waiting at the
> DMV...). I still need to think some more about how to impose more structure
> (than just "black box" functions) on the set of all possible shuffles (and
> hence on the shuffle classes). Primarily, this structure would enable
> reasoning about "dead ends" when composing shuffles. For example with AVX,
> if the desired shuffle does not move elements across 128-bit lanes, then
> all PSHUFB shuffles that cross lanes are "dead ends" in the search space
> (this applies more to the problem of computing perfect shuffles online,
> rather than building tables).
>
> This is an interesting way to look at it.  Here is a related, but
> different way, with some backstory:
>
> The perfect shuffle code does a really good job on PPC, for 4-element
> vectors.  This width is really important for GLSL, OpenCL and related
> graphics technologies (so it is an important special case) but doesn't
> handle PPC's other cases, including v16i8.  As others have noted, "perfect
> shuffle" doesn't work in the general case because of huge table size... but
> what if we changed the tables?
>
> At least on PPC, perfect shuffle is only used when a shuffle has a cost of
> 3 or less.  If you look at the table that gets indexed into, we have:
>

It looks like the check in PPCISelLowering.cpp is actually < 3:
    if (Cost < 3)
      return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);

which reduces the used part of the table to more along the lines of 26%
(and hence 74% of the table is unused).

>
> // 31 entries have cost 0
> // 292 entries have cost 1
> // 1384 entries have cost 2
> // 3061 entries have cost 3
> // 1733 entries have cost 4
> // 60 entries have cost 5
>
> ... which means that ~73% of the table entries are actually used, which
> makes this a good encoding for PPC.  If you bump out the table to a vector
> width of 8 or wider though, I have not proven this, but I suspect that the
> vast majority of the table would be dead: it may be possible to generate
> every crazy shuffle, but the generated code would never be used, so it
> isn't worth storing in the table.
>

-- Sean Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130508/e8f4ee34/attachment.html>