[LLVMbugs] [Bug 22563] New: Incorrect code generation with arrays of __m256 variables

Thu Feb 12 04:32:07 PST 2015

http://llvm.org/bugs/show_bug.cgi?id=22563

            Bug ID: 22563
           Summary: Incorrect code generation with arrays of __m256
                    variables
           Product: new-bugs
           Version: 3.6
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: jasonr at 3db-labs.com
                CC: llvmbugs at cs.uiuc.edu
    Classification: Unclassified

Created attachment 13854
  --> http://llvm.org/bugs/attachment.cgi?id=13854&action=edit
C++ source file that demonstrates the issue

NOTE: This was originally posted on Stack Overflow[1]. After getting some
comcurrence that this is likely a clang/LLVM bug, I posted it here.

I'm encountering what appears to be a bug causing incorrect code generation
with clang 3.4, 3.5, and 3.6. The source that actually triggered the problem is
quite complicated, but I've been able to reduce it to a self-contained example
that is attached to this report.

A summary of the code: I have a simple type called `simd_pack` that contains
one member, an array of one `__m256i` value. In my application, there are
operators and functions that take these types, but the problem can be
illustrated by the above example. Specifically, `test_broken()` should read
from the `in1` array and then just copy its value over to the `out` array.
Therefore, the call to `memcmp()` in `main()` should return zero. I compile the
above using the following:

    clang++-3.6 bug_test.cc -o bug_test -mavx -O3

I find that on optimization levels `-O0` and `-O1`, the test passes, while on
levels `-O2` and `-O3`, the test fails. I've tried compiling the same file with
gcc 4.4, 4.6, 4.7, and 4.8, as well as Intel C++ 13.0, and the test passes on
all optimization levels.

Taking a closer look at the generated code, here's the assembly generated on
optimization level `-O3`:

    0000000000400a40 <test_broken(signed char*, signed char*, unsigned long)>:
      400a40:       55                      push   %rbp
      400a41:       48 89 e5                mov    %rsp,%rbp
      400a44:       48 81 e4 e0 ff ff ff    and    $0xffffffffffffffe0,%rsp
      400a4b:       48 83 ec 40             sub    $0x40,%rsp
      400a4f:       48 83 fa 20             cmp    $0x20,%rdx
      400a53:       72 2f                   jb     400a84 <test_broken(signed
char*, signed char*, unsigned long)+0x44>
      400a55:       31 c0                   xor    %eax,%eax
      400a57:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
      400a5e:       00 00 
      400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
      400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
      400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
      400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)
      400a74:       48 8d 48 20             lea    0x20(%rax),%rcx
      400a78:       48 83 c0 3f             add    $0x3f,%rax
      400a7c:       48 39 d0                cmp    %rdx,%rax
      400a7f:       48 89 c8                mov    %rcx,%rax
      400a82:       72 dc                   jb     400a60 <test_broken(signed
char*, signed char*, unsigned long)+0x20>
      400a84:       48 89 ec                mov    %rbp,%rsp
      400a87:       5d                      pop    %rbp
      400a88:       c5 f8 77                vzeroupper 
      400a8b:       c3                      retq   
      400a8c:       0f 1f 40 00             nopl   0x0(%rax)

I'll reproduce the key part for emphasis:

      400a60:       c5 fc 10 04 06          vmovups (%rsi,%rax,1),%ymm0
      400a65:       c5 f8 29 04 24          vmovaps %xmm0,(%rsp)
      400a6a:       c5 fc 28 04 24          vmovaps (%rsp),%ymm0
      400a6f:       c5 fc 11 04 07          vmovups %ymm0,(%rdi,%rax,1)

The generated code is strange. It first loads 256 bits into `ymm0` using the
unaligned move that I asked for, then it stores `xmm0` (which only contains the
lower 128 bits of the data that was read) to the stack, then immediately reads
256 bits into `ymm0` from the stack location that was just written to. The
effect is that `ymm0`'s upper 128 bits (which get written to the output buffer)
are garbage, causing the test to fail.

Are there any particular optimization steps that could be disabled to work
around this issue, or a different way to express my intent in code that might
not trigger it? I apologize for the lack of a reduced bitcode test case as
explained here[2], but I'm not familiar enough with the toolchain to drive the
tools properly.

[1]:
http://stackoverflow.com/questions/28462707/is-this-incorrect-code-generation-with-arrays-of-m256-values-a-clang-bug
[2]: http://llvm.org/docs/HowToSubmitABug.html

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20150212/e044d67b/attachment.html>