[llvm-dev] Possible bug in x86 frame lowering with SSE instructions?

Mon Oct 26 15:51:11 PDT 2020

Hello, everyone.

I'm looking for some insight into a bug I encountered while testing
some custom IR passes on Solaris (x86) and Linux. I don't know if it's
a bug with the x86 backend or the way the frame is set up by Solaris
-- or if I'm simply doing something I shouldn't be doing. The bug
manifests even if I don't run any of my passes, so I'm certain those
aren't the issue.

Given the following test C code:

    int main(int argc, char **argv) {
      int x[10] = {1,2,3};
      return 0;
    }

I compile it to IR with the following arguments:

  clang --target=i386-sun-solaris -S -emit-llvm -Xclang
-disable-O0-optnone -x c -c array-test.c -o array-test.ll

This yields the following IR:

    target datalayout =
"e-m:e-p:32:32-p270:32:32-p271:32:32-p272:64:64-f64:32:64-f80:32-n8:16:32-S128"
    target triple = "i386-sun-solaris"

    ; Function Attrs: noinline nounwind
    define dso_local i32 @main(i32 %0, i8** %1) #0 {
      %3 = alloca i32, align 4
      %4 = alloca i32, align 4
      %5 = alloca i8**, align 4
      %6 = alloca [10 x i32], align 4
      store i32 0, i32* %3, align 4
      store i32 %0, i32* %4, align 4
      store i8** %1, i8*** %5, align 4
      %7 = bitcast [10 x i32]* %6 to i8*
      call void @llvm.memset.p0i8.i32(i8* align 4 %7, i8 0, i32 40, i1 false)
      %8 = bitcast i8* %7 to [10 x i32]*
      %9 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 0
      store i32 1, i32* %9, align 4
      %10 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 1
      store i32 2, i32* %10, align 4
      %11 = getelementptr inbounds [10 x i32], [10 x i32]* %8, i32 0, i32 2
      store i32 3, i32* %11, align 4
      ret i32 0
    }

    ; Function Attrs: argmemonly nounwind willreturn writeonly
    declare void @llvm.memset.p0i8.i32(i8* nocapture writeonly, i8,
i32, i1 immarg) #1

    attributes #0 = { noinline nounwind
"correctly-rounded-divide-sqrt-fp-math"="false"
"disable-tail-calls"="false" "frame-pointer"="all"
"less-precise-fpmad"="false" "min-legal-vector-width"="0"
"no-infs-fp-math"="false" "no-jump-tables"="false"
"no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false"
"no-trapping-math"="true" "stack-protector-buffer-size"="8"
"target-cpu"="pentium4"
"target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87"
"unsafe-fp-math"="false" "use-soft-float"="false" }
    attributes #1 = { argmemonly nounwind willreturn writeonly }

Normally, I would run custom passes at this point via opt. But the
error I'm getting occurs with or without this step.

Without changing anything else, I run this IR through llc with the
following arguments:

    llc --x86-asm-syntax=intel --filetype=asm array-test.ll -o=array-test.s

This results in the following assembly:

            .text
            .intel_syntax noprefix
            .file   "/home/user/code/array-test.ll"
            .globl  main                            # -- Begin function main
            .p2align        4, 0x90
            .type   main, at function
    main:                                   # @main
    # %bb.0:
            push    ebp
            mov     ebp, esp
            sub     esp, 56
            mov     dword ptr [ebp - 4], 0
            xorps   xmm0, xmm0
            movaps  xmmword ptr [ebp - 56], xmm0
            movaps  xmmword ptr [ebp - 40], xmm0
            mov     dword ptr [ebp - 20], 0
            mov     dword ptr [ebp - 24], 0
            mov     dword ptr [ebp - 56], 1
            mov     dword ptr [ebp - 52], 2
            mov     dword ptr [ebp - 48], 3
            xor     eax, eax
            add     esp, 56
            pop     ebp
            ret
    .Lfunc_end0:
            .size   main, .Lfunc_end0-main
                                            # -- End function
            .ident  "clang version 12.0.0
(https://github.com/llvm/llvm-project.git
62dbbcf6d7c67b02fd540a5a1e55c494bf88adea)"
            .section        ".note.GNU-stack","", at progbits

Other than target being i386-sun-solaris, this is  exact same code
generated in both instances if I target i386-pc-linux-gnu.

If I run this on Linux (Ubuntu 18.04 in this case), there are no
problems. If I run this on Solaris, however, a segfault occurs on the
first `movaps` instruction. I believe the issue is because the stack
is 4-byte aligned on Solaris whereas it's 8-bit aligned on Linux, so
the 56- and 40-byte offsets for the array stores just happen to work
on Linux -- while they end up being 8 bytes off on Solaris.

Running llc with --stackrealign fixes the problem:

    main:                                   # @main
    # %bb.0:
            push    ebp
            mov     ebp, esp
            and     esp, -16
            sub     esp, 64
            mov     dword ptr [esp + 12], 0
            xorps   xmm0, xmm0
            movaps  xmmword ptr [esp + 16], xmm0
            movaps  xmmword ptr [esp + 32], xmm0
            mov     dword ptr [esp + 52], 0
            mov     dword ptr [esp + 48], 0
            mov     dword ptr [esp + 16], 1
            mov     dword ptr [esp + 20], 2
            mov     dword ptr [esp + 24], 3
            xor     eax, eax
            mov     esp, ebp
            pop     ebp
            ret

Running clang with -fomit-frame-pointer also fixes the problem, but I
have no idea why. Adding --stack-alignment=16 does *not* fix the
problem. If I explicitly add the -O0 flag to llc, the
`X86TargetLowering::getOptimalMemOpType()` function doesn't lower the
array stores to `movaps`:

    main:                                   # @main
    # %bb.0:
            push    ebp
            mov     ebp, esp
            push    esi
            sub     esp, 68
            mov     eax, dword ptr [ebp + 12]
            mov     ecx, dword ptr [ebp + 8]
            xor     edx, edx
            mov     dword ptr [ebp - 8], 0
            lea     esi, [ebp - 48]
            mov     dword ptr [esp], esi
            mov     dword ptr [esp + 4], 0
            mov     dword ptr [esp + 8], 40
            mov     dword ptr [ebp - 52], eax       # 4-byte Spill
            mov     dword ptr [ebp - 56], ecx       # 4-byte Spill
            mov     dword ptr [ebp - 60], edx       # 4-byte Spill
            call    memset
            mov     dword ptr [ebp - 48], 1
            mov     dword ptr [ebp - 44], 2
            mov     dword ptr [ebp - 40], 3
            mov     eax, dword ptr [ebp - 60]       # 4-byte Reload
            add     esp, 68
            pop     esi
            pop     ebp
            ret

I've spent the better part of ten hours trying to debug the X86
backend code (and I am, admittedly, not the best at knowing where to
look). I determined the `X86FrameLowering::emitPrologue()` function
will *only* emit the proper offset adjustment if
`X86RegisterInfo::needsStackRealignment()` returns `true`, and the
only thing that seems to force it to return `true` is if
--stackrealign is used (which sets the "stackrealign" function
attribute on `main`).

I don't know if this is truly a bug in the X86 backend (an assumption
about the ABI on Linux vs. Solaris? Maybe? I'm truly guessing...) or
if this is a result of me using -disable-O0-optnone in Clang without
-O0 in llc.

Any insight would be helpful, and thanks for reading my rather verbose message.