[llvm-bugs] [Bug 27885] New: shuffle lowering does never use shufps for int vectors

Wed May 25 21:55:13 PDT 2016

https://llvm.org/bugs/show_bug.cgi?id=27885

            Bug ID: 27885
           Summary: shuffle lowering does never use shufps for int vectors
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: sroland at vmware.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

Some seemingly simple shuffles with integers are pretty difficult to do with
sse, and even with avx2 in fact. Unfortunately llvm does never use shufps with
i32 vectors, missing quite some opportunities for shuffle lowering (as this is
one of the more powerful shuffle instructions).

This example:
define <4 x i32> @noshufps(<4 x i32> %val1, <4 x i32> %val2) {
entry:
   %res = shufflevector <4 x i32> %val1, <4 x i32> %val2, <4 x i32> <i32 0, i32
3, i32 4, i32 7>
   ret <4 x i32> %res
}

will produce (with corei7-avx):
        vpshufd $196, %xmm1, %xmm1      # xmm1 = xmm1[0,1,0,3]
        vpshufd $236, %xmm0, %xmm0      # xmm0 = xmm0[0,3,2,3]
        vpblendw        $240, %xmm1, %xmm0, %xmm0 # xmm0 =
xmm0[0,1,2,3],xmm1[4,5,6,7]

(with just sse2 the vpblendw is replaced with unpcklqdq, with avx2 the vpblendw
is replaced by vpblendd instead, but those differences are not significant).

This is definitely suboptimal, since you could just do it with a single shufps.
In fact, if the vectors are floats, llvm will exactly do that (it is,
unfortunately, not possible to force it to use a shufps by bitcasting the
vectors to floats, llvm is too clever for that - at least one of the inputs or
the output vector needs to be recognized as a "real" float).

I'm not entirely sure if that's even on purpose, but on just about anything
than maybe Nehalem a single shufps would be much better. Nehalem has quite bad
bypass delays (2 clocks) from ivec to float and back, but according to Agner
Fog's guides on just about everything else the delays are either just one clock
(maybe K8/K10), there's no delay at all (Sandy Bridge and newer fall into this
category), or shufps (and similar float shuffles) are actually in the ivec
domain in the first place (Core2 45nm, Bulldozers belong in that category,
Core2 65nm is weird but it doesn't matter if it's float or int shuffle
neither).

Such patterns are not all that uncommon. I've actually noticed it analyzing
output of the vector shift emulation for the ridiculously missing true vector
shift, and in the end it did a movsd/movsd/pshufd/pshufd/unpckldq to get the
shift result back into position, whereas movsd/movsd/shufps would have easily
been enough - albeit with sse41 this particular case isn't a problem as it will
use 3 pblendw instead. But shufps is also very useful for instance for
inverting a unpckldq (that is, select every 2nd 32bit element in a 2 vector
shuffle).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160526/ed1365f5/attachment.html>