[llvm] r214042 - [x86] Add a much more powerful framework for combining x86 shuffle

Sun Jul 27 18:37:14 PDT 2014

On Sun, Jul 27, 2014 at 6:18 PM, Pete Cooper <peter_cooper at apple.com> wrote:

> > While the Intel manuals claim it should be used when it replaces 5 or
> > more instructions (!!!!) my experience is that it is actually very fast
> > on modern chips, and so I've gon with a much more aggressive model of
> > replacing any sequence of 3 or more instructions
> Can you give us performance results from the llvm test suite or some other
> hand written tests which back this up?
>

I don't think you'll see huge differences in the test suite...

The way i've been exploring this is by looking and benchmarks that have a
really hot function the loop vectorizer fires on. A great example is the
reference C code in x264 (not the inline assembly, obviously). In that
benchmark I've seen PSHUFB be essentially the *same* performance as any
other shuffle instruction. See below:

>
> Also, can you let us know what 'modern' means here? Atom processors
> support shuffles and will get this transform. Should this transformation be
> done on them?
>
> I'd like to understand whether this applies to processors prior to sandy
> bridge for example.

Ok, my interpretation of what is going on here is largely informed by
Agner's timings but matches what I see in benchmarks.

According to Agner, going as far back on the main x86 stack as I ever
really bother tuning (core2 and nehalem) PSHUFB and every other PSHUF and
PUNPCK instruction has the same cost: 1 uop, and 0.5 recip. throughput. So
why does Intel's manual discourage PSHUFB's use so heavily? I think for two
reasons:

1) To use PSHUFB, you have to tie up either a register or load from memory
2) (A minor point) w/o VEX encoding, you can't make a copy and PSHUFB in
the same instruction (similar to PUNPCK)

I think Intel is trying to account for #1 really, we don't do much for #2
under any circumstances. However, I see very few fast, vectorized loops
that are under such register pressure that #1 really kicks in, and I only
see memory accesses for PSHUFB when it is not in a loop and thus I doubt
the performance matters much.

Now, Atom is an interesting question, and I'll admit I hadn't looked at it
carefully. Whomever is maintaining the Atom port these days may well want
to add a subtarget threshold here because Agner seems to indicate that on
Atom PSHUFB *is* crazy expensive, and indeed about 6x the cost of a normal
instruction, which suddenly matches the Intel guidelines (replace over 5
instructions...)

However, this patch still isn't (and can't be) a regression really. The
current code essentially only forms single shuffle instructions or PSHUFB.
So I don't expect this combine to realistically fire today on anything
other than canonicalizing PSHUFD to UNPCK variants when we have VEX
encodings and the shorter encoding is a win. It should only start firing
when we enable the new shufle lowering. And even then, it won't regress
Atom because today Atom is already getting PSHUFBs everywhere even when
they colud be avoided. =/

-Chandler
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140727/58d1007e/attachment.html>