<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Hi Dimitry,</div><div class=""><br class=""></div>Thanks for the report! I took a quick look at the test and found the following:<div class=""><br class=""></div><div class="">1) Before SLP we have this:</div><div class=""><br class=""></div><div class=""><font face="Menlo" class="">@cout = global [24 x i8] zeroinitializer, <b class="">align 8</b></font></div><div class=""><font face="Menlo" class="">…</font></div><div class=""><font face="Menlo" class="">define void @_Z7do_initv() #0 {</font></div><div class=""><div class=""><font face="Menlo" class="">entry:</font></div><div class=""><font face="Menlo" class=""> store i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)*** bitcast ([24 x i8]* @cout to i32 (...)***), align 8, !tbaa !2</font></div><div class=""><font face="Menlo" class=""> store i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**), i32 (...)*** bitcast (i8* getelementptr inbounds ([24 x i8], [24 x i8]* @cout, i64 0, i64 8) to i32 (...)***), <b class="">align 8</b>, !tbaa !2</font></div><div class=""><font face="Menlo" class=""> ret void</font></div><div class=""><font face="Menlo" class="">}</font></div></div><div class=""><br class=""></div><div class="">2) After SLP we get this:</div><div class=""><div class=""><br class=""></div><div class=""><font face="Menlo" class="">@cout = global [24 x i8] zeroinitializer, <b class="">align 8</b></font></div><div class=""><div class=""><font face="Menlo" class="">…</font></div></div><div class=""><font face="Menlo" class="">define void @_Z7do_initv() #2 {</font></div><div class=""><font face="Menlo" class="">entry:</font></div><div class=""><font face="Menlo" class=""> store <2 x i32 (...)**> <i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**)>, <2 x i32 (...)**>* bitcast ([24 x i8]* @cout to <2 x i32 (...)**>*), <b class="">align 8</b>, !tbaa !2</font></div><div class=""><font face="Menlo" class=""> ret void</font></div><div class=""><font face="Menlo" class="">}</font></div></div><div class=""><br class=""></div><div class="">3) And then, after instcobmine, we get this:</div><div class=""><div class=""><br class=""></div><div class=""><font face="Menlo" class="">@cout = global [24 x i8] zeroinitializer, <b class="">align 16</b></font></div></div><div class=""><div class=""><font face="Menlo" class="">…</font></div><div class=""><font face="Menlo" class="">; Function Attrs: nounwind ssp uwtable</font></div><div class=""><font face="Menlo" class="">define void @_Z7do_initv() #2 {</font></div><div class=""><font face="Menlo" class="">entry:</font></div><div class=""><font face="Menlo" class=""> store <2 x i32 (...)**> <i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 3) to i32 (...)**), i32 (...)** bitcast (i8** getelementptr inbounds ([10 x i8*], [10 x i8*]* @_ZTV1BIciE, i64 0, i64 8) to i32 (...)**)>, <2 x i32 (...)**>* bitcast ([24 x i8]* @cout to <2 x i32 (...)**>*), <b class="">align 16</b>, !tbaa !2</font></div><div class=""><font face="Menlo" class=""> ret void</font></div><div class=""><font face="Menlo" class="">}</font></div></div><div class=""><br class=""></div><div class="">I’ll take a look at instombine to understand why it replaces "align 8" with "align 16” in this case. Maybe it’s just a bug, or maybe we shouldn’t vectorize this case for some reason.</div><div class=""><br class=""></div><div class="">Thanks,</div><div class="">Michael</div><div class=""><br class=""></div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Oct 10, 2015, at 11:53 AM, Dimitry Andric <<a href="mailto:dimitry@andric.com" class="">dimitry@andric.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div class="">On 19 Jun 2015, at 19:40, Michael Zolotukhin <<a href="mailto:mzolotukhin@apple.com" class="">mzolotukhin@apple.com</a>> wrote:<br class=""><blockquote type="cite" class="">Author: mzolotukhin<br class="">Date: Fri Jun 19 12:40:15 2015<br class="">New Revision: 240144<br class=""><br class="">URL: <a href="http://llvm.org/viewvc/llvm-project?rev=240144&view=rev" class="">http://llvm.org/viewvc/llvm-project?rev=240144&view=rev</a><br class="">Log:<br class="">[SLP] Vectorize for all-constant entries.<br class=""><br class="">Differential Revision: <a href="http://reviews.llvm.org/D10558" class="">http://reviews.llvm.org/D10558</a><br class=""></blockquote><br class="">Hi,<br class=""><br class="">tl;dr: after some research, I found that this commit appears to change the behavior of llvm in such a way as to cause trouble for libc++ users on FreeBSD, due to alignment changes.<br class=""><br class="">Recently, I upgraded llvm and clang to 3.7.0 in the FreeBSD base system. Shortly afterwards, I received reports from users that some of their previously compiled C++ applications (dynamically linked to libc++.so.1) were crashing with SIGBUS errors. The stack traces always looked like this:<br class=""><br class=""> #0 0x0000000800c95cfd in std::__1::basic_ostream<char, std::__1::char_traits<char> >::basic_ostream (this=<optimized out>, __sb=<optimized out>) at /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280<br class=""> #1 std::__1::ios_base::Init::Init (this=<optimized out>) at /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53<br class=""> #2 0x0000000800c96029 in ?? () at /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:44 from /usr/obj<br class=""> /usr/src/lib/libc++/libc++.so.1<br class=""> #3 0x0000000800cef352 in ?? () from /usr/obj/usr/src/lib/libc++/libc++.so.1<br class=""> #4 0x0000000800628c00 in ?? ()<br class=""> #5 0x00007fffffffdf18 in ?? ()<br class=""> #6 0x00007fffffffdea0 in ?? ()<br class=""> #7 0x0000000800c925c6 in _init () from /usr/obj/usr/src/lib/libc++/libc++.so.1<br class=""> #8 0x00007fffffffdea0 in ?? ()<br class=""><br class="">E.g. it crashed in ios_base::Init::Init(), at this line:<br class=""><br class=""> ostream* cout_ptr = ::new(cout) ostream(::new(__cout) __stdoutbuf<char>(stdout, &mb_cout));<br class=""><br class="">and the relevant disassembly is:<br class=""><br class=""> 0x0000000800c95ccf <+255>: callq 0x800c96210 <std::__1::__stdoutbuf<char>::__stdoutbuf(__sFILE*, __mbstate_t*)><br class=""> 0x0000000800c95cd4 <+260>: mov 0x27d565(%rip),%rax # 0x800f13240<br class=""> 0x0000000800c95cdb <+267>: lea 0x40(%rax),%rcx<br class=""> 0x0000000800c95cdf <+271>: movq %rcx,%xmm0<br class=""> 0x0000000800c95ce4 <+276>: lea 0x18(%rax),%rcx<br class=""> 0x0000000800c95ce8 <+280>: movq %rcx,%xmm1<br class=""> 0x0000000800c95ced <+285>: punpcklqdq %xmm0,%xmm1<br class=""> 0x0000000800c95cf1 <+289>: movdqa %xmm1,-0x40(%rbp)<br class=""> 0x0000000800c95cf6 <+294>: mov 0x27d86b(%rip),%rcx # 0x800f13568<br class=""> => 0x0000000800c95cfd <+301>: movdqa %xmm1,(%rcx)<br class=""> 0x0000000800c95d01 <+305>: mov (%rax),%r14<br class=""> 0x0000000800c95d04 <+308>: lea (%rcx,%r14,1),%rbx<br class=""> 0x0000000800c95d08 <+312>: mov %rbx,%rdi<br class=""> 0x0000000800c95d0b <+315>: mov %r15,%rsi<br class=""> 0x0000000800c95d0e <+318>: callq 0x800c92c4c <_ZNSt3__18ios_base4initEPv@plt><br class=""><br class="">In this case, %rcx was the address of std::cout, 0x609068 (or some other address that was aligned to 8 bytes, *not* 16 bytes), which causes movdqa to result in a SIGBUS.<br class=""><br class="">These crashing executables were compiled by clang 3.6.1 (or earlier versions), and it turns out that all of them had a global symbol for std::cout, which was aligned to 8 bytes. For example, in case of the original report, one of the executables showed the following in readelf output:<br class=""><br class=""> Relocation section '.rela.dyn' at offset 0x2070 contains 6 entries:<br class=""> Offset Info Type Symbol's Value Symbol's Name + Addend<br class=""> [...]<br class=""> 0000000000609068 0000003c00000005 R_X86_64_COPY 0000000000609068 _ZNSt3__14coutE + 0<br class=""><br class="">and further down:<br class=""><br class=""> Symbol table '.dynsym' contains 87 entries:<br class=""> Num: Value Size Type Bind Vis Ndx Name<br class=""> [...]<br class=""> 60: 0000000000609068 160 OBJECT GLOBAL DEFAULT 25 _ZNSt3__14coutE<br class=""><br class="">(Note: the executable gets a copy of std::cout, because the linker finds a global data object with copy relocations.)<br class=""><br class="">The std::cout symbol is explicitly declared with an alignment in iostream.cpp, as follows:<br class=""><br class=""> _ALIGNAS_TYPE (ostream) _LIBCPP_FUNC_VIS char cout[sizeof(ostream)];<br class=""><br class="">The alignment of ostream is 8 bytes, therefore the alignment of cout is also 8 bytes.<br class=""><br class="">When libc++ was previously compiled by clang 3.6.1, the assembly generated from ios_base::Init::Init() looked different than above, and std::cout was explicitly aligned to 8 bytes, in the .bss section:<br class=""><br class=""> #DEBUG_VALUE: Init:this <- RDI<br class=""> .loc 3 669 5 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:669:5<br class=""> movq $0, 136(%r15,%rbx)<br class=""> .loc 3 670 5 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:670:5<br class=""> movl $-1, 144(%r15,%rbx)<br class="">.Ltmp44:<br class=""> .loc 2 53 77 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:77<br class=""> movq __stdoutp@GOTPCREL(%rip), %r15<br class=""> movq (%r15), %rsi<br class=""> .loc 2 53 45 is_stmt 0 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:45<br class=""> leaq _ZNSt3__1L6__coutE(%rip), %r12<br class=""> leaq _ZNSt3__1L7mb_coutE(%rip), %rdx<br class=""> movq %r12, %rdi<br class="">.Ltmp45:<br class=""> callq _ZNSt3__111__stdoutbufIcEC2EP7__sFILEP11__mbstate_t<br class=""> .loc 20 280 1 is_stmt 1 # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280:1<br class="">.Ltmp46:<br class=""> movq _ZTVNSt3__113basic_ostreamIcNS_11char_traitsIcEEEE@GOTPCREL(%rip), %rax<br class=""> leaq 24(%rax), %rcx<br class=""> movq %rcx, -64(%rbp) # 8-byte Spill<br class=""> movq _ZNSt3__14coutE@GOTPCREL(%rip), %rbx<br class=""> movq %rcx, (%rbx)<br class=""> leaq 64(%rax), %rcx<br class=""> movq %rcx, -48(%rbp) # 8-byte Spill<br class=""> movq %rcx, 8(%rbx)<br class=""> .loc 20 281 5 # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5<br class="">.Ltmp47:<br class=""> movq (%rax), %r14<br class=""> leaq (%rbx,%r14), %rdi<br class=""> .loc 3 668 15 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:668:15<br class="">.Ltmp48:<br class="">.Ltmp6:<br class=""> movq %r12, %rsi<br class=""> callq _ZNSt3__18ios_base4initEPv@PLT<br class="">[... skip to .bss ...]<br class=""> .type _ZNSt3__14coutE,@object # @_ZNSt3__14coutE<br class=""> .globl _ZNSt3__14coutE<br class=""> .align 8<br class="">_ZNSt3__14coutE:<br class=""> .zero 160<br class=""> .size _ZNSt3__14coutE, 160<br class=""><br class="">In contrast, when you compile the same file with clang 3.7.0, the assembly becomes similar to the crashing case:<br class=""><br class=""> #DEBUG_VALUE: Init:this <- RDI<br class=""> .loc 3 669 12 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:669:12<br class=""> movq $0, 136(%rbx)<br class=""> .loc 3 670 13 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:670:13<br class=""> movl $-1, 144(%rbx)<br class="">.Ltmp44:<br class=""> .loc 2 53 77 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:77<br class=""> movq __stdoutp@GOTPCREL(%rip), %r12<br class=""> movq (%r12), %rsi<br class=""> .loc 2 53 59 is_stmt 0 # /usr/src/lib/libc++/../../contrib/libc++/src/iostream.cpp:53:59<br class=""> leaq _ZNSt3__1L6__coutE(%rip), %r15<br class=""> leaq _ZNSt3__1L7mb_coutE(%rip), %rdx<br class=""> movq %r15, %rdi<br class="">.Ltmp45:<br class=""> callq _ZNSt3__111__stdoutbufIcEC2EP7__sFILEP11__mbstate_t<br class=""> .loc 20 280 1 is_stmt 1 # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:280:1<br class="">.Ltmp46:<br class=""> movq _ZTVNSt3__113basic_ostreamIcNS_11char_traitsIcEEEE@GOTPCREL(%rip), %rax<br class=""> leaq 64(%rax), %rcx<br class=""> movd %rcx, %xmm0<br class=""> leaq 24(%rax), %rcx<br class=""> movd %rcx, %xmm1<br class=""> punpcklqdq %xmm0, %xmm1 # xmm1 = xmm1[0],xmm0[0]<br class=""> movdqa %xmm1, -64(%rbp) # 16-byte Spill<br class=""> movq _ZNSt3__14coutE@GOTPCREL(%rip), %rcx<br class=""> movdqa %xmm1, (%rcx)<br class="">.Ltmp47:<br class=""> .loc 20 281 5 # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5<br class=""> movq (%rax), %r14<br class="">.Ltmp48:<br class=""> .loc 20 281 5 is_stmt 0 # /usr/src/lib/libc++/../../contrib/libc++/include/ostream:281:5<br class=""> leaq (%rcx,%r14), %rbx<br class=""> .loc 3 668 5 is_stmt 1 # /usr/src/lib/libc++/../../contrib/libc++/include/ios:668:5<br class="">.Ltmp49:<br class="">.Ltmp6:<br class=""> movq %rbx, %rdi<br class=""> movq %r15, %rsi<br class=""> callq _ZNSt3__18ios_base4initEPv@PLT<br class=""><br class="">and the definition of of std::cout is now with 16 bytes alignment instead:<br class=""><br class=""> .type _ZNSt3__14coutE,@object # @_ZNSt3__14coutE<br class=""> .globl _ZNSt3__14coutE<br class=""> .align 16<br class="">_ZNSt3__14coutE:<br class=""> .zero 160<br class=""> .size _ZNSt3__14coutE, 160<br class=""><br class="">Bisecting has shown that r240144 is the commit where this behavior changed, i.e. before this commit, std::cout is aligned at 8 bytes, and the instructions to access it take this into account. After this commit, it becomes aligned at 16 bytes, and it is then accessed using SSE (movdqa, etc). (Note that it is certainly possible that r240144 is just exposing a deeper issue, instead of being the root cause!)<br class=""><br class="">So unfortunately, this new alignment makes a dynamic libc++ library, compiled by clang after this commit, incompatible with previously built applications. This is a big problem for FreeBSD, since we rather value backwards compatibility. :-)<br class=""><br class="">I have had several discussions with people who indicate that 16 byte alignment is an x86_64 ABI requirement, at least for "large enough" objects. This may very well be true, but on the other hand, if the program authors explicitly specify that their objects must be aligned to X bytes, where X < 16, then the compiler should obey this, right? And changing existing behavior is incompatible with previously compiled programs.<br class=""><br class="">As a temporary workaround, I have now reverted r240144 in FreeBSD, which restores the previous behavior, and aligns std::cout to 8 bytes again. But I would like to bring this to your attention, in the hope that we can find out what the real fix should be.<br class=""><br class="">And to finish this way too long mail, I am attaching a minimized test case, which shows the behavior, and which should be appropriate to attach to a PR, if that is needed later on.<br class=""><br class="">This test case should be compiled with -O2, to see the effects. Clang 3.6.x will result in the following IR for std::cout:<br class=""><br class=""> @cout = global [24 x i8] zeroinitializer, align 8<br class=""><br class="">while clang 3.7.0 will result in:<br class=""><br class=""> @cout = global [24 x i8] zeroinitializer, align 16<br class=""><br class="">-Dimitry<br class=""><span id="cid:6A8C6927-64C4-4DC2-95F2-D79C7D297DE1@hsd1.ca.comcast.net"><cout-align.cpp></span></div></div></blockquote></div><br class=""></div></body></html>