[llvm-dev] Possible bug in x86 frame lowering with SSE instructions?

Tue Oct 27 04:22:07 PDT 2020

For what it's worth, I found the patch:

https://reviews.llvm.org/D87615

On Tue, Oct 27, 2020 at 5:52 AM Jonathan Smith <jvstech+llvm at gmail.com> wrote:
>
> Interesting. Thank you.
>
> I'm still curious to know what commit fixed this problem, although it
> sounds like it's also a problem with how Solaris is implementing the
> ABI.
>
> I suppose it's time for me to go hunting through commits.
>
> On Tue, Oct 27, 2020 at 2:21 AM Wang, Pengfei <pengfei.wang at intel.com> wrote:
> >
> > Hi Jonathan,
> >
> > It seems the trunk code solves this problem. https://godbolt.org/z/Y1Wdbj
> > I took a look at the x86 ABI: https://gitlab.com/x86-psABIs/i386-ABI/-/tree/hjl/x86/1.1#
> > It says "In other words, the value (%esp + 4) is always a multiple of 16 (32 or 64) when control is transferred to the function entry point."
> > So if the OS follows the ABI, the ESP's value should always be 0xXXXXXXXC when enters to a function, and it turns to be 0xXXXXXXX8 after "push ebp". Which happens to be aligned to 8.
> >
> > Thanks
> > Pengfei
> >
> > -----Original Message-----
> > From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Jonathan Smith via llvm-dev
> > Sent: Tuesday, October 27, 2020 6:51 AM
> > To: llvm-dev <llvm-dev at lists.llvm.org>
> > Subject: [llvm-dev] Possible bug in x86 frame lowering with SSE instructions?
> >
> > Hello, everyone.
> >
> > I'm looking for some insight into a bug I encountered while testing some custom IR passes on Solaris (x86) and Linux. I don't know if it's a bug with the x86 backend or the way the frame is set up by Solaris
> > -- or if I'm simply doing something I shouldn't be doing. The bug manifests even if I don't run any of my passes, so I'm certain those aren't the issue.
> >
> > Given the following test C code:
> >
> >     int main(int argc, char **argv) {
> >       int x[10] = {1,2,3};
> >       return 0;
> >     }
> >
> > I compile it to IR with the following arguments:
> >
> >   clang --target=i386-sun-solaris -S -emit-llvm -Xclang -disable-O0-optnone -x c -c array-test.c -o array-test.ll
> >
> > This yields the following IR:
> >
> >     target datalayout =
> > "e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128"
> >     target triple = "i386-sun-solaris"
> >
> >     ; Function Attrs: noinline nounwind
> >     define dso_local i32 @main(i32 %0, i8** %1) #0 {
> >       %3 = alloca i32, align 4
> >       %4 = alloca i32, align 4
> >       %5 = alloca i8**, align 4
> >       %6 = alloca [10 x i32], align 4
> >       store i32 0, i32* %3, align 4
> >       store i32 %0, i32* %4, align 4
> >       store i8** %1, i8*** %5, align 4
> >       %7 = bitcast [10 x i32]* %6 to i8*
> >       call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false)
> >       %8 = bitcast i8* %7 to [10 x i32]*
> >       %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0
> >       store i32 1, i32* %9, align 4
> >       %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1
> >       store i32 2, i32* %10, align 4
> >       %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2
> >       store i32 3, i32* %11, align 4
> >       ret i32 0
> >     }
> >
> >     ; Function Attrs: argmemonly nounwind willreturn writeonly
> >     declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8, i32, i1 immarg) #1
> >
> >     attributes #0 = { noinline nounwind
> > "correctly-rounded-divide-sqrt-fp-math"="false"
> > "disable-tail-calls"="false" "frame-pointer"="all"
> > "less-precise-fpmad"="false" "min-legal-vector-width"="0"
> > "no-infs-fp-math"="false" "no-jump-tables"="false"
> > "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
> > "no-trapping-math"="true" "stack-protector-buffer-size"="8"
> > "target-cpu"="pentium4"
> > "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87"
> > "unsafe-fp-math"="false" "use-soft-float"="false" }
> >     attributes #1 = { argmemonly nounwind willreturn writeonly }
> >
> > Normally, I would run custom passes at this point via opt. But the error I'm getting occurs with or without this step.
> >
> > Without changing anything else, I run this IR through llc with the following arguments:
> >
> >     llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s
> >
> > This results in the following assembly:
> >
> >             .text
> >             .intel_syntax noprefix
> >             .file   "/home/user/code/array-test.ll"
> >             .globl  main                            # -- Begin function main
> >             .p2align        4, 0x90
> >             .type   main, at function
> >     main:                                   # @main
> >     # %bb.0:
> >             push    ebp
> >             mov     ebp, esp
> >             sub     esp, 56
> >             mov     dword ptr [ebp - 4], 0
> >             xorps   xmm0, xmm0
> >             movaps  xmmword ptr [ebp - 56], xmm0
> >             movaps  xmmword ptr [ebp - 40], xmm0
> >             mov     dword ptr [ebp - 20], 0
> >             mov     dword ptr [ebp - 24], 0
> >             mov     dword ptr [ebp - 56], 1
> >             mov     dword ptr [ebp - 52], 2
> >             mov     dword ptr [ebp - 48], 3
> >             xor     eax, eax
> >             add     esp, 56
> >             pop     ebp
> >             ret
> >     .Lfunc_end0:
> >             .size   main, .Lfunc_end0-main
> >                                             # -- End function
> >             .ident  "clang version 12.0.0 (https://github.com/llvm/llvm-project.git
> > 62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)"
> >             .section        ".note.GNU-stack","", at progbits
> >
> > Other than target being i386-sun-solaris, this is  exact same code generated in both instances if I target i386-pc-linux-gnu.
> >
> > If I run this on Linux (Ubuntu 18.04 in this case), there are no problems. If I run this on Solaris, however, a segfault occurs on the first `movaps` instruction. I believe the issue is because the stack is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so the 56- and 40-byte offsets for the array stores just happen to work on Linux -- while they end up being 8 bytes off on Solaris.
> >
> > Running llc with --stackrealign fixes the problem:
> >
> >     main:                                   # @main
> >     # %bb.0:
> >             push    ebp
> >             mov     ebp, esp
> >             and     esp, -16
> >             sub     esp, 64
> >             mov     dword ptr [esp + 12], 0
> >             xorps   xmm0, xmm0
> >             movaps  xmmword ptr [esp + 16], xmm0
> >             movaps  xmmword ptr [esp + 32], xmm0
> >             mov     dword ptr [esp + 52], 0
> >             mov     dword ptr [esp + 48], 0
> >             mov     dword ptr [esp + 16], 1
> >             mov     dword ptr [esp + 20], 2
> >             mov     dword ptr [esp + 24], 3
> >             xor     eax, eax
> >             mov     esp, ebp
> >             pop     ebp
> >             ret
> >
> > Running clang with -fomit-frame-pointer also fixes the problem, but I have no idea why. Adding --stack-alignment=16 does *not* fix the problem. If I explicitly add the -O0 flag to llc, the `X86TargetLowering::getOptimalMemOpType()` function doesn't lower the array stores to `movaps`:
> >
> >     main:                                   # @main
> >     # %bb.0:
> >             push    ebp
> >             mov     ebp, esp
> >             push    esi
> >             sub     esp, 68
> >             mov     eax, dword ptr [ebp + 12]
> >             mov     ecx, dword ptr [ebp + 8]
> >             xor     edx, edx
> >             mov     dword ptr [ebp - 8], 0
> >             lea     esi, [ebp - 48]
> >             mov     dword ptr [esp], esi
> >             mov     dword ptr [esp + 4], 0
> >             mov     dword ptr [esp + 8], 40
> >             mov     dword ptr [ebp - 52], eax       # 4-byte Spill
> >             mov     dword ptr [ebp - 56], ecx       # 4-byte Spill
> >             mov     dword ptr [ebp - 60], edx       # 4-byte Spill
> >             call    memset
> >             mov     dword ptr [ebp - 48], 1
> >             mov     dword ptr [ebp - 44], 2
> >             mov     dword ptr [ebp - 40], 3
> >             mov     eax, dword ptr [ebp - 60]       # 4-byte Reload
> >             add     esp, 68
> >             pop     esi
> >             pop     ebp
> >             ret
> >
> > I've spent the better part of ten hours trying to debug the X86 backend code (and I am, admittedly, not the best at knowing where to look). I determined the `X86FrameLowering::emitPrologue()` function will *only* emit the proper offset adjustment if `X86RegisterInfo::needsStackRealignment()` returns `true`, and the only thing that seems to force it to return `true` is if --stackrealign is used (which sets the "stackrealign" function attribute on `main`).
> >
> > I don't know if this is truly a bug in the X86 backend (an assumption about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or if this is a result of me using -disable-O0-optnone in Clang without
> > -O0 in llc.
> >
> > Any insight would be helpful, and thanks for reading my rather verbose message.
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev