[llvm-bugs] [Bug 44379] New: Byte shuffles pessimized into shuffle chain

Wed Dec 25 02:31:12 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=44379

            Bug ID: 44379
           Summary: Byte shuffles pessimized into shuffle chain
           Product: new-bugs
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: sneves at dei.uc.pt
                CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org

Consider this code to rotate 32-bit words by 16 in variously-sized vectors:

define <16 x i8> @f1(<16 x i8>) local_unnamed_addr #0 {
  %2 = shufflevector <16 x i8> %0, <16 x i8> undef, <16 x i32> <i32 2, i32 3,
i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 10, i32 11, i32 12, i32 13, i32
14, i32 15, i32 8, i32 9>
  ret <16 x i8> %2
}

define <32 x i8> @f2(<32 x i8>) local_unnamed_addr #0 {
  %2 = shufflevector <32 x i8> %0, <32 x i8> undef, <32 x i32> <i32 2, i32 3,
i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 10, i32 11, i32 12, i32 13, i32
14, i32 15, i32 8, i32 9, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32
16, i32 17, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 24, i32 25>
  ret <32 x i8> %2
}

define <64 x i8> @f3(<64 x i8>) local_unnamed_addr #0 {
  %2 = shufflevector <64 x i8> %0, <64 x i8> undef, <64 x i32> <i32 2, i32 3,
i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 10, i32 11, i32 12, i32 13, i32
14, i32 15, i32 8, i32 9, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32
16, i32 17, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 24, i32 25, i32
34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 32, i32 33, i32 42, i32 43, i32
44, i32 45, i32 46, i32 47, i32 40, i32 41, i32 50, i32 51, i32 52, i32 53, i32
54, i32 55, i32 48, i32 49, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32
56, i32 57>
  ret <64 x i8> %2
}

The output of this using `llc -O3 -mattr=+ssse3,+avx2,+avx512bw` is 

f1:                                     # @f1
        vpshuflw        xmm0, xmm0, 57  # xmm0 = xmm0[1,2,3,0,4,5,6,7]
        vpshufhw        xmm0, xmm0, 57  # xmm0 = xmm0[0,1,2,3,5,6,7,4]
        ret
f2:                                     # @f2
        vpshuflw        ymm0, ymm0, 57  # ymm0 =
ymm0[1,2,3,0,4,5,6,7,9,10,11,8,12,13,14,15]
        vpshufhw        ymm0, ymm0, 57  # ymm0 =
ymm0[0,1,2,3,5,6,7,4,8,9,10,11,13,14,15,12]
        ret
f3:                                     # @f3
        vpshuflw        zmm0, zmm0, 57  # zmm0 =
zmm0[1,2,3,0,4,5,6,7,9,10,11,8,12,13,14,15,17,18,19,16,20,21,22,23,25,26,27,24,28,29,30,31]
        vpshufhw        zmm0, zmm0, 57  # zmm0 =
zmm0[0,1,2,3,5,6,7,4,8,9,10,11,13,14,15,12,16,17,18,19,21,22,23,20,24,25,26,27,29,30,31,28]
        ret

This seems nonsensical. Let's try another way to define the rotation:

define <4 x i32> @f0(<4 x i32>) local_unnamed_addr #0 {
  %2 = lshr <4 x i32> %0, <i32 16, i32 16, i32 16, i32 16>
  %3 = shl <4 x i32> %0, <i32 16, i32 16, i32 16, i32 16>
  %4 = or <4 x i32> %3, %2
  ret <4 x i32> %4
}

_Now_ we get the byte shuffle:

.LCPI0_0:
        .byte   2                       # 0x2
        .byte   3                       # 0x3
        .byte   0                       # 0x0
        .byte   1                       # 0x1
        .byte   6                       # 0x6
        .byte   7                       # 0x7
        .byte   4                       # 0x4
        .byte   5                       # 0x5
        .byte   10                      # 0xa
        .byte   11                      # 0xb
        .byte   8                       # 0x8
        .byte   9                       # 0x9
        .byte   14                      # 0xe
        .byte   15                      # 0xf
        .byte   12                      # 0xc
        .byte   13                      # 0xd
f0:                                     # @f0
        pshufb  xmm0, xmmword ptr [rip + .LCPI0_0] # xmm0 =
xmm0[2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13]
        ret

However, by adding the feature +fast-variable-shuffle we get the intended
pshufb in all cases. (Sidenote: when +avx512vl is defined, the byte shuffles
should be pattern matched into vprold, but that's a different issue).

>From a cursory look at the LLVM source code, it looks like
+fast-variable-shuffle is only defined for Haswell and above, despite the fact
that byte shuffles have been fast since at least the 45nm Core 2 Duo. Since the
default optimization target is based on Sandy Bridge, which does not have
+fast-variable-shuffle (despite byte shuffles there having twice the throughput
of Haswell and Skylake!), the byte shuffles are replaced by a worse `vpshuflw +
vpshufhw` pair.

So while +fast-variable-shuffle will fix this case in particular, this seems
like an optimization pass failure somewhere else, given that the rotation
implemented as arithmetic translates to the byte shuffle just fine.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20191225/5f526685/attachment-0001.html>