[llvm-bugs] [Bug 45604] New: Wrong code generated for vector shuffle on x86_64-linux-gnu

Sat Apr 18 21:24:40 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=45604

            Bug ID: 45604
           Summary: Wrong code generated for vector shuffle on
                    x86_64-linux-gnu
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: kazu at google.com
                CC: craig.topper at gmail.com, llvm-bugs at lists.llvm.org,
                    llvm-dev at redking.me.uk, spatel+llvm at rotateright.com

A recent patch:

https://github.com/llvm/llvm-project/commit/e30d29ebc12a29b2843d40e575809d642920b1a4

triggers causes the following testcase to produce an incorrect result:

// driver.cc
#include <stdio.h>

extern "C" void foo(unsigned short* dst, unsigned short *src);

unsigned short src[8] __attribute__((aligned(16))) = {
  0x0100, 0x0101, 0x0102, 0x0103, 0x0104, 0x0105, 0x0106, 0x0107,
};

unsigned short dst[32] __attribute__((aligned(16)));

int main() {
  foo(dst, src);
  for (int i = 0; i < 16; i++) printf("%04x ", dst[i]);
  printf("\n");
  for (int i = 0; i < 16; i++) printf("%04x ", dst[i + 16]);
  printf("\n");
  return 0;
}

; vec.ll
define void @foo(<32 x i16>* %dst, <8 x i16>* %src) {
  %v1 = load <8 x i16>, <8 x i16>* %src, align 16
  %v2 = shufflevector <8 x i16> %v1,
                      <8 x i16> zeroinitializer,
                      <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32
6, i32 7,
                                  i32 8, i32 9, i32 10, i32 11, i32 12, i32 13,
i32 14, i32 15>
  %v3 = shufflevector <16 x i16> %v2,
                      <16 x i16> <i16 11, i16 11, i16 11, i16 11, i16 11, i16
11, i16 11, i16 11,
                                  i16 0, i16 0, i16 0, i16 0, i16 0, i16 0, i16
0, i16 0>,
                      <32 x i32> <i32 0, i32 8, i32 16, i32 24, i32 1, i32 9,
i32 17, i32 25,
                                  i32 2, i32 10, i32 18, i32 26, i32 3, i32 11,
i32 19, i32 27,
                                  i32 4, i32 12, i32 20, i32 28, i32 5, i32 13,
i32 21, i32 29,
                                  i32 6, i32 14, i32 22, i32 30, i32 7, i32 15,
i32 23, i32 31>
  store <32 x i16> %v3, <32 x i16>* %dst, align 16
  ret void
}

Build and run like so on x86_64:

$ path/to/clang -O2 driver.cc vec.ll
$ ./a.out
0100 0101 000b 0000 0101 0000 000b 0000 0102 0000 000b 0000 0103 0000 000b 0000 
0104 0000 000b 0000 0105 0000 000b 0000 0106 0000 000b 0000 0107 0000 000b 0000

Notice that the second number (immediately after 0100) should be 0000, not
0101, because we are printing out:

  %v3[0] %v3[1] ...

which comes from:

  %v2[0] %v2[8] ...

which in turn comes from:

  %v1[0] zeroinitializer[0] ...

If I revert Simon's patch, I get 0000 as expected.

The patch in question introduces code like so around
llvm/lib/Target/X86/X86ISelLowering.cpp:7466:

    while (Scl.getOpcode() == ISD::TRUNCATE ||
           Scl.getOpcode() == ISD::ANY_EXTEND ||
           Scl.getOpcode() == ISD::ZERO_EXTEND)
      Scl = Scl.getOperand(0);

The test case above reaches this "while" loop with the selection DAG nodes like
so:

t128: v4i32 = scalar_to_vector t127                       <- N
  t127: i32 = zero_extend t87                             <- SCl (initially)
    t87: i16 = truncate t86
      t86: i32 = extract_vector_elt t85, Constant:i64<0>  <- SrcExtract
        t85: v4i32 = bitcast t7                           <- SrcVec

Note that we extract an i32 value, truncate it down to i16, and zero-extend it
back to i32.

I haven't completely understood the code being patched, but here is my current
theory.

The author of the patch seems to want to express the nodes between
scalar_to_vector and extract_vector_elt, inclusive, as one shuffle instruction
on t85.

However, if we express the expression as a shuffle instruction on the bottom
lane of t85, we end up with 32 bits with the rest being 0. What we want is the
bottom 16 bits of t85 with the rest being 0 because we have truncation down to
i16.

If we want to strip ISD::TRUNCATE, we should probably determine the smallest
number of bits required from the source operand (16 bits from t85 in this
case).  Then issue a shuffle on a v8i16 value sandwiched with bitcasts like so:

  t??: v4i32 = bitcast t??
    t??: v8i16 = shuffle t??
      t??: v8i16 bitcast t85

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200419/c1738f74/attachment.html>