<div dir="ltr"><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Jul 27, 2014 at 6:18 PM, Pete Cooper <span dir="ltr"><<a href="mailto:peter_cooper@apple.com" target="_blank">peter_cooper@apple.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="">> While the Intel manuals claim it should be used when it replaces 5 or<br>

> more instructions (!!!!) my experience is that it is actually very fast<br>

> on modern chips, and so I've gon with a much more aggressive model of<br>

> replacing any sequence of 3 or more instructions<br>

</div>Can you give us performance results from the llvm test suite or some other hand written tests which back this up?<br></blockquote><div><br></div><div>I don't think you'll see huge differences in the test suite...</div>

<div><br></div><div>The way i've been exploring this is by looking and benchmarks that have a really hot function the loop vectorizer fires on. A great example is the reference C code in x264 (not the inline assembly, obviously). In that benchmark I've seen PSHUFB be essentially the *same* performance as any other shuffle instruction. See below:</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

Also, can you let us know what 'modern' means here? Atom processors support shuffles and will get this transform. Should this transformation be done on them?<br>

<br>

I'd like to understand whether this applies to processors prior to sandy bridge for example.</blockquote></div><div class="gmail_extra"><br></div>Ok, my interpretation of what is going on here is largely informed by Agner's timings but matches what I see in benchmarks.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">According to Agner, going as far back on the main x86 stack as I ever really bother tuning (core2 and nehalem) PSHUFB and every other PSHUF and PUNPCK instruction has the same cost: 1 uop, and 0.5 recip. throughput. So why does Intel's manual discourage PSHUFB's use so heavily? I think for two reasons:</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">1) To use PSHUFB, you have to tie up either a register or load from memory</div><div class="gmail_extra">2) (A minor point) w/o VEX encoding, you can't make a copy and PSHUFB in the same instruction (similar to PUNPCK)</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">I think Intel is trying to account for #1 really, we don't do much for #2 under any circumstances. However, I see very few fast, vectorized loops that are under such register pressure that #1 really kicks in, and I only see memory accesses for PSHUFB when it is not in a loop and thus I doubt the performance matters much.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">Now, Atom is an interesting question, and I'll admit I hadn't looked at it carefully. Whomever is maintaining the Atom port these days may well want to add a subtarget threshold here because Agner seems to indicate that on Atom PSHUFB *is* crazy expensive, and indeed about 6x the cost of a normal instruction, which suddenly matches the Intel guidelines (replace over 5 instructions...)</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">However, this patch still isn't (and can't be) a regression really. The current code essentially only forms single shuffle instructions or PSHUFB. So I don't expect this combine to realistically fire today on anything other than canonicalizing PSHUFD to UNPCK variants when we have VEX encodings and the shorter encoding is a win. It should only start firing when we enable the new shufle lowering. And even then, it won't regress Atom because today Atom is already getting PSHUFBs everywhere even when they colud be avoided. =/</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">-Chandler</div><div class="gmail_extra"><br></div></div>