[cfe-dev] libc++ std::cout alignment trouble (was: Re: [llvm] r240144 - [SLP] Vectorize for all-constant entries.)

Sat Oct 10 14:17:38 PDT 2015

Hi,

This is a smaller reproducer of the issue:

$ cat test.ll
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.10.0"

@cout = global [24 x i8] zeroinitializer, align 8

define void @foo() {
entry:
  store i64 100, i64* bitcast ([24 x i8]* @cout to i64*), align 8
  store i64 200, i64* bitcast (i8* getelementptr inbounds ([24 x i8], [24 x i8]* @cout, i64 0, i64 8) to i64*), align 8
  ret void
}

$ opt -slp-vectorizer < test.ll -S
; ModuleID = '<stdin>'
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.10.0"

@cout = global [24 x i8] zeroinitializer, align 8

define void @foo() {
entry:
  store <2 x i64> <i64 100, i64 200>, <2 x i64>* bitcast ([24 x i8]* @cout to <2 x i64>*), align 8
  ret void
}

$ opt -slp-vectorizer < test.ll  | opt -instcombine -S
; ModuleID = '<stdin>'
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.10.0"

@cout = global [24 x i8] zeroinitializer, align 16

define void @foo() {
entry:
  store <2 x i64> <i64 100, i64 200>, <2 x i64>* bitcast ([24 x i8]* @cout to <2 x i64>*), align 16
  ret void
}

It looks like instcombine intentionally rewrites the alignment. It happens in getOrEnforceKnownAlignment called by InstCombiner::visitStoreInst. Then compiler performs several checks and decides that it can bump up the alignment, which is presumable illegal in this case.

Reid,
it looks like you relatively recently touched code in enforceKnownAlignment. Could you please comment on whether it’s correct behavior or not?

Thanks,
Michael

> On Oct 10, 2015, at 12:24 PM, Michael Zolotukhin via cfe-dev <cfe-dev at lists.llvm.org> wrote:
> 
> Hi Dimitry,
> 
> Thanks for the report! I took a quick look at the test and found the following:
> 
> 1) Before SLP we have this:
> 
> @cout = global [24 x i8] zeroinitializer, align 8
> …
> define void @_Z7do_initv() #0 {
> entry:
>   store i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)*** bitcast ([24 x i8]* @cout to i32 (...)***), align 8, !tbaa !2
>   store i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**), i32 (...)*** bitcast (i8* getelementptr inbounds ([24 x i8], [24 x i8]* @cout, i64 0, i64 8) to i32 (...)***), align 8, !tbaa !2
>   ret void
> }
> 
> 2) After SLP we get this:
> 
> @cout = global [24 x i8] zeroinitializer, align 8
> …
> define void @_Z7do_initv() #2 {
> entry:
>   store <2 x i32 (...)**> <i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**)>, <2 x i32 (...)**>* bitcast ([24 x i8]* @cout to <2 x i32 (...)**>*), align 8, !tbaa !2
>   ret void
> }
> 
> 3) And then, after instcobmine, we get this:
> 
> @cout = global [24 x i8] zeroinitializer, align 16
> …
> ; Function Attrs: nounwind ssp uwtable
> define void @_Z7do_initv() #2 {
> entry:
>   store <2 x i32 (...)**> <i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**)>, <2 x i32 (...)**>* bitcast ([24 x i8]* @cout to <2 x i32 (...)**>*), align 16, !tbaa !2
>   ret void
> }
> 
> I’ll take a look at instombine to understand why it replaces "align 8" with "align 16” in this case. Maybe it’s just a bug, or maybe we shouldn’t vectorize this case for some reason.
> 
> Thanks,
> Michael
> 
> 
>> On Oct 10, 2015, at 11:53 AM, Dimitry Andric <dimitry at andric.com <mailto:dimitry at andric.com>> wrote:
>> 
>> On 19 Jun 2015, at 19:40, Michael Zolotukhin <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>> wrote:
>>> Author: mzolotukhin
>>> Date: Fri Jun 19 12:40:15 2015
>>> New Revision: 240144
>>> 
>>> URL: http://llvm.org/viewvc/llvm-project?rev=240144&view=rev <http://llvm.org/viewvc/llvm-project?rev=240144&view=rev>
>>> Log:
>>> [SLP] Vectorize for all-constant entries.
>>> 
>>> Differential Revision: http://reviews.llvm.org/D10558 <http://reviews.llvm.org/D10558>
>> 
>> Hi,
>> 
>> tl;dr: after some research, I found that this commit appears to change the behavior of llvm in such a way as to cause trouble for libc++ users on FreeBSD, due to alignment changes.
>> 
>> Recently, I upgraded llvm and clang to 3.7.0 in the FreeBSD base system.  Shortly afterwards, I received reports from users that some of their previously compiled C++ applications (dynamically linked to libc++.so.1) were crashing with SIGBUS errors.  The stack traces always looked like this:
>> 
>>    #0  0x0000000800c95cfd in std::__1::basic_ostream<char, std::__1::char_traits<char> >::basic_ostream (this=<optimized out>, __sb=<optimized out>) at /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280
>>    #1  std::__1::ios_base::Init::Init (this=<optimized out>) at /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53
>>    #2  0x0000000800c96029 in ?? () at /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:44 from /usr/obj
>>    /usr/src/lib/libc++/libc++.so.1
>>    #3  0x0000000800cef352 in ?? () from /usr/obj/usr/src/lib/libc++/libc++.so.1
>>    #4  0x0000000800628c00 in ?? ()
>>    #5  0x00007fffffffdf18 in ?? ()
>>    #6  0x00007fffffffdea0 in ?? ()
>>    #7  0x0000000800c925c6 in _init () from /usr/obj/usr/src/lib/libc++/libc++.so.1
>>    #8  0x00007fffffffdea0 in ?? ()
>> 
>> E.g. it crashed in ios_base::Init::Init(), at this line:
>> 
>>    ostream* cout_ptr = ::new(cout) ostream(::new(__cout) __stdoutbuf<char>(stdout, &mb_cout));
>> 
>> and the relevant disassembly is:
>> 
>>    0x0000000800c95ccf <+255>:   callq  0x800c96210 <std::__1::__stdoutbuf<char>::__stdoutbuf(__sFILE*, __mbstate_t*)>
>>    0x0000000800c95cd4 <+260>:   mov    0x27d565(%rip),%rax        # 0x800f13240
>>    0x0000000800c95cdb <+267>:   lea    0x40(%rax),%rcx
>>    0x0000000800c95cdf <+271>:   movq   %rcx,%xmm0
>>    0x0000000800c95ce4 <+276>:   lea    0x18(%rax),%rcx
>>    0x0000000800c95ce8 <+280>:   movq   %rcx,%xmm1
>>    0x0000000800c95ced <+285>:   punpcklqdq %xmm0,%xmm1
>>    0x0000000800c95cf1 <+289>:   movdqa %xmm1,-0x40(%rbp)
>>    0x0000000800c95cf6 <+294>:   mov    0x27d86b(%rip),%rcx        # 0x800f13568
>> => 0x0000000800c95cfd <+301>:   movdqa %xmm1,(%rcx)
>>    0x0000000800c95d01 <+305>:   mov    (%rax),%r14
>>    0x0000000800c95d04 <+308>:   lea    (%rcx,%r14,1),%rbx
>>    0x0000000800c95d08 <+312>:   mov    %rbx,%rdi
>>    0x0000000800c95d0b <+315>:   mov    %r15,%rsi
>>    0x0000000800c95d0e <+318>:   callq  0x800c92c4c <_ZNSt3__18ios_base4initEPv at plt>
>> 
>> In this case, %rcx was the address of std::cout, 0x609068 (or some other address that was aligned to 8 bytes, *not* 16 bytes), which causes movdqa to result in a SIGBUS.
>> 
>> These crashing executables were compiled by clang 3.6.1 (or earlier versions), and it turns out that all of them had a global symbol for std::cout, which was aligned to 8 bytes.  For example, in case of the original report, one of the executables showed the following in readelf output:
>> 
>>    Relocation section '.rela.dyn' at offset 0x2070 contains 6 entries:
>>        Offset             Info             Type               Symbol's Value  Symbol's Name + Addend
>>    [...]
>>    0000000000609068  0000003c00000005 R_X86_64_COPY          0000000000609068 _ZNSt3__14coutE + 0
>> 
>> and further down:
>> 
>>    Symbol table '.dynsym' contains 87 entries:
>>       Num:    Value          Size Type    Bind   Vis      Ndx Name
>>    [...]
>>        60: 0000000000609068   160 OBJECT  GLOBAL DEFAULT   25 _ZNSt3__14coutE
>> 
>> (Note: the executable gets a copy of std::cout, because the linker finds a global data object with copy relocations.)
>> 
>> The std::cout symbol is explicitly declared with an alignment in iostream.cpp, as follows:
>> 
>>    _ALIGNAS_TYPE (ostream)  _LIBCPP_FUNC_VIS char cout[sizeof(ostream)];
>> 
>> The alignment of ostream is 8 bytes, therefore the alignment of cout is also 8 bytes.
>> 
>> When libc++ was previously compiled by clang 3.6.1, the assembly generated from ios_base::Init::Init() looked different than above, and std::cout was explicitly aligned to 8 bytes, in the .bss section:
>> 
>>        #DEBUG_VALUE: Init:this <- RDI
>>        .loc    3 669 5                 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:669:5
>>        movq    $0, 136(%r15,%rbx)
>>        .loc    3 670 5                 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:670:5
>>        movl    $-1, 144(%r15,%rbx)
>> .Ltmp44:
>>        .loc    2 53 77                 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:77
>>        movq    __stdoutp at GOTPCREL(%rip), %r15
>>        movq    (%r15), %rsi
>>        .loc    2 53 45 is_stmt 0       # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:45
>>        leaq    _ZNSt3__1L6__coutE(%rip), %r12
>>        leaq    _ZNSt3__1L7mb_coutE(%rip), %rdx
>>        movq    %r12, %rdi
>> .Ltmp45:
>>        callq   _ZNSt3__111__stdoutbufIcEC2EP7__sFILEP11__mbstate_t
>>        .loc    20 280 1 is_stmt 1      # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280:1
>> .Ltmp46:
>>        movq    _ZTVNSt3__113basic_ostreamIcNS_11char_traitsIcEEEE at GOTPCREL(%rip), %rax
>>        leaq    24(%rax), %rcx
>>        movq    %rcx, -64(%rbp)         # 8-byte Spill
>>        movq    _ZNSt3__14coutE at GOTPCREL(%rip), %rbx
>>        movq    %rcx, (%rbx)
>>        leaq    64(%rax), %rcx
>>        movq    %rcx, -48(%rbp)         # 8-byte Spill
>>        movq    %rcx, 8(%rbx)
>>        .loc    20 281 5                # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5
>> .Ltmp47:
>>        movq    (%rax), %r14
>>        leaq    (%rbx,%r14), %rdi
>>        .loc    3 668 15                # /usr/src/lib/libc++/../../contrib/libc++/include/ios:668:15
>> .Ltmp48:
>> .Ltmp6:
>>        movq    %r12, %rsi
>>        callq   _ZNSt3__18ios_base4initEPv at PLT
>> [... skip to .bss ...]
>>        .type   _ZNSt3__14coutE, at object # @_ZNSt3__14coutE
>>        .globl  _ZNSt3__14coutE
>>        .align  8
>> _ZNSt3__14coutE:
>>        .zero   160
>>        .size   _ZNSt3__14coutE, 160
>> 
>> In contrast, when you compile the same file with clang 3.7.0, the assembly becomes similar to the crashing case:
>> 
>>        #DEBUG_VALUE: Init:this <- RDI
>>        .loc    3 669 12                # /usr/src/lib/libc++/../../contrib/libc++/include/ios:669:12
>>        movq    $0, 136(%rbx)
>>        .loc    3 670 13                # /usr/src/lib/libc++/../../contrib/libc++/include/ios:670:13
>>        movl    $-1, 144(%rbx)
>> .Ltmp44:
>>        .loc    2 53 77                 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:77
>>        movq    __stdoutp at GOTPCREL(%rip), %r12
>>        movq    (%r12), %rsi
>>        .loc    2 53 59 is_stmt 0       # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:59
>>        leaq    _ZNSt3__1L6__coutE(%rip), %r15
>>        leaq    _ZNSt3__1L7mb_coutE(%rip), %rdx
>>        movq    %r15, %rdi
>> .Ltmp45:
>>        callq   _ZNSt3__111__stdoutbufIcEC2EP7__sFILEP11__mbstate_t
>>        .loc    20 280 1 is_stmt 1      # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280:1
>> .Ltmp46:
>>        movq    _ZTVNSt3__113basic_ostreamIcNS_11char_traitsIcEEEE at GOTPCREL(%rip), %rax
>>        leaq    64(%rax), %rcx
>>        movd    %rcx, %xmm0
>>        leaq    24(%rax), %rcx
>>        movd    %rcx, %xmm1
>>        punpcklqdq      %xmm0, %xmm1    # xmm1 = xmm1[0],xmm0[0]
>>        movdqa  %xmm1, -64(%rbp)        # 16-byte Spill
>>        movq    _ZNSt3__14coutE at GOTPCREL(%rip), %rcx
>>        movdqa  %xmm1, (%rcx)
>> .Ltmp47:
>>        .loc    20 281 5                # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5
>>        movq    (%rax), %r14
>> .Ltmp48:
>>        .loc    20 281 5 is_stmt 0      # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5
>>        leaq    (%rcx,%r14), %rbx
>>        .loc    3 668 5 is_stmt 1       # /usr/src/lib/libc++/../../contrib/libc++/include/ios:668:5
>> .Ltmp49:
>> .Ltmp6:
>>        movq    %rbx, %rdi
>>        movq    %r15, %rsi
>>        callq   _ZNSt3__18ios_base4initEPv at PLT
>> 
>> and the definition of of std::cout is now with 16 bytes alignment instead:
>> 
>>        .type   _ZNSt3__14coutE, at object # @_ZNSt3__14coutE
>>        .globl  _ZNSt3__14coutE
>>        .align  16
>> _ZNSt3__14coutE:
>>        .zero   160
>>        .size   _ZNSt3__14coutE, 160
>> 
>> Bisecting has shown that r240144 is the commit where this behavior changed, i.e. before this commit, std::cout is aligned at 8 bytes, and the instructions to access it take this into account.  After this commit, it becomes aligned at 16 bytes, and it is then accessed using SSE (movdqa, etc).  (Note that it is certainly possible that r240144 is just exposing a deeper issue, instead of being the root cause!)
>> 
>> So unfortunately, this new alignment makes a dynamic libc++ library, compiled by clang after this commit, incompatible with previously built applications.  This is a big problem for FreeBSD, since we rather value backwards compatibility. :-)
>> 
>> I have had several discussions with people who indicate that 16 byte alignment is an x86_64 ABI requirement, at least for "large enough" objects.  This may very well be true, but on the other hand, if the program authors explicitly specify that their objects must be aligned to X bytes, where X < 16, then the compiler should obey this, right?  And changing existing behavior is incompatible with previously compiled programs.
>> 
>> As a temporary workaround, I have now reverted r240144 in FreeBSD, which restores the previous behavior, and aligns std::cout to 8 bytes again.  But I would like to bring this to your attention, in the hope that we can find out what the real fix should be.
>> 
>> And to finish this way too long mail, I am attaching a minimized test case, which shows the behavior, and which should be appropriate to attach to a PR, if that is needed later on.
>> 
>> This test case should be compiled with -O2, to see the effects.  Clang 3.6.x will result in the following IR for std::cout:
>> 
>>    @cout = global [24 x i8] zeroinitializer, align 8
>> 
>> while clang 3.7.0 will result in:
>> 
>>    @cout = global [24 x i8] zeroinitializer, align 16
>> 
>> -Dimitry
>> <cout-align.cpp>
> 
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20151010/0b353ca9/attachment.html>