[Lldb-commits] [PATCH] Profile Assembly Until Ret Instruction

Wed Jul 30 18:32:42 PDT 2014

GCC seems to generate a row for epilogue.
Do you think this is a clang bug, or at least a discrepancy between clang &
gcc?

Source:
int f() {
puts("HI\n");
return 5;
}

Compile option: only -g

gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
clang version 3.5.0 (213114)

Env: Ubuntu 14.04, x86_64

drawfdump -F of clang binary:
<    2><0x00400530:0x00400559><f><fde offset 0x00000088 length:
0x0000001c><eh aug data len 0x0>
        0x00400530: <off cfa=08(r7) > <off r16=-8(cfa) >
        0x00400531: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa) >
        0x00400534: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa) >

drawfdump -F of gcc binary:
<    1><0x0040052d:0x00400542><f><fde offset 0x00000070 length:
0x0000001c><eh aug data len 0x0>
        0x0040052d: <off cfa=08(r7) > <off r16=-8(cfa) >
        0x0040052e: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa) >
        0x00400531: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa) >
        0x00400541: <off cfa=08(r7) > <off r6=-16(cfa) > <off r16=-8(cfa) >

On Wed, Jul 30, 2014 at 5:43 PM, Jason Molenda <jmolenda at apple.com> wrote:

> I'm open to trying to trust eh_frame at frame 0 for x86_64.  The lack of
> epilogue descriptions in eh_frame is the biggest problem here.
>
> When you "step" or "next" in the debugger, the debugger instruction steps
> across the source line until it gets to the next source line.  Every time
> it stops after an instruction step, it confirms that it is (1) between the
> start and end pc values for the source line, and (2) that the "stack id"
> (start address of the function + CFA address) is the same.  If it stops and
> the stack id has changed, for a "next" command, it will backtrace one stack
> frame to see if it stepped into a function.  If so, it sets a breakpoint on
> the return address and continues.
>
> If you switch lldb to prefer eh_frame instructions for x86_64, e.g.
>
> Index: source/Plugins/Process/Utility/RegisterContextLLDB.cpp
> ===================================================================
> --- source/Plugins/Process/Utility/RegisterContextLLDB.cpp      (revision
> 214344)
> +++ source/Plugins/Process/Utility/RegisterContextLLDB.cpp      (working
> copy)
> @@ -791,6 +791,22 @@
>          }
>      }
>
> +    // For x86_64 debugging, let's try using the eh_frame instructions
> even if this is the currently
> +    // executing function (frame zero).
> +    Target *target = exe_ctx.GetTargetPtr();
> +    if (target
> +        && (target->GetArchitecture().GetCore() ==
> ArchSpec::eCore_x86_64_x86_64h
> +            || target->GetArchitecture().GetCore() ==
> ArchSpec::eCore_x86_64_x86_64))
> +    {
> +        unwind_plan_sp = func_unwinders_sp->GetUnwindPlanAtCallSite
> (m_current_offset_backed_up_one);
> +        int valid_offset = -1;
> +        if (IsUnwindPlanValidForCurrentPC(unwind_plan_sp, valid_offset))
> +        {
> +            UnwindLogMsgVerbose ("frame uses %s for full UnwindPlan,
> preferred over assembly profiling on x86_64",
> unwind_plan_sp->GetSourceName().GetCString());
> +            return unwind_plan_sp;
> +        }
> +    }
> +
>      // Typically the NonCallSite UnwindPlan is the unwind created by
> inspecting the assembly language instructions
>      if (behaves_like_zeroth_frame)
>      {
>
>
> you'll find that you have to "next" twice to step out of a function.  Why?
>  With a simple function like:
>
> * thread #1: tid = 0xaf31e, 0x0000000100000eb9 a.out`foo + 25 at a.c:5,
> queue = 'com.apple.main-thread', stop reason = step over
>     #0: 0x0000000100000eb9 a.out`foo + 25 at a.c:5
>    2    int foo ()
>    3    {
>    4        puts("HI");
> -> 5        return 5;
>    6    }
>    7
>    8    int bar ()
> (lldb) disass
> a.out`foo at a.c:3:
>    0x100000ea0:  pushq  %rbp
>    0x100000ea1:  movq   %rsp, %rbp
>    0x100000ea4:  subq   $0x10, %rsp
>    0x100000ea8:  leaq   0x6b(%rip), %rdi          ; "HI"
>    0x100000eaf:  callq  0x100000efa               ; symbol stub for: puts
>    0x100000eb4:  movl   $0x5, %ecx
> -> 0x100000eb9:  movl   %eax, -0x4(%rbp)
>    0x100000ebc:  movl   %ecx, %eax
>    0x100000ebe:  addq   $0x10, %rsp
>    0x100000ec2:  popq   %rbp
>    0x100000ec3:  retq
>
>
> if you do "next" lldb will instruction step, comparing the stack ID at
> every stop, until it gets to 0x100000ec3 at which point the stack ID will
> change.  The CFA address (which the eh_frame tells us is rbp+16) just
> changed to the caller's CFA address because we're about to return.  The
> eh_frame instructions really need to tell us that the CFA is now rsp+8 at
> 0x100000ec3.
>
> The end result is that you need to "next" twice to step out of a function.
>
> AssemblyParse_x86 has a special bit where it looks or the 'ret'
> instruction sequence at the end of the function -
>
>    // Now look at the byte at the end of the AddressRange for a limited
> attempt at describing the
>     // epilogue.  We're looking for the sequence
>
>     //  [ 0x5d ] mov %rbp, %rsp
>     //  [ 0xc3 ] ret
>     //  [ 0xe8 xx xx xx xx ] call __stack_chk_fail  (this is sometimes the
> final insn in the function)
>
>     // We want to add a Row describing how to unwind when we're stopped on
> the 'ret' instruction where the
>     // CFA is no longer defined in terms of rbp, but is now defined in
> terms of rsp like on function entry.
>
>
> and adds an extra row of unwind details for that instruction.
>
>
> I mention x86_64 as being a possible good test case here because I worry
> about the i386 picbase sequence (call next-instruction; pop $ebx) which
> occurs a lot.  But for x86_64, my main concern is the epilogues.
>
>
>
> > On Jul 30, 2014, at 2:52 PM, Tong Shen <endlessroad at google.com> wrote:
> >
> > Thanks Jason! That's a very informative post, clarify things a lot :-)
> >
> > Well I have to admit that my patch is specifically for certain kind of
> functions, and now I see that's not the general case.
> >
> > I did some experiment with gdb. gdb uses CFI for frame 0, either x86 or
> x86_64. It looks for FDE of frame 0, and do CFA calculations according to
> that.
> >
> > - For compiler generated functions: I think there are 2 usage scenarios
> for frame 0: breakpoint and signal.
> >     - Breakpoints are usually at source line boundary instead of
> instruction boundary, and generally we won't be caught at stack pointer
> changing locations, so CFI is still valid.
> >     - For signal, synchronous unwind table may not be sufficient here.
> But only stack changing instructions will cause incorrect CFA calculation,
> so it' not always the case.
> > - For hand written assembly functions: from what I've seen, most of the
> time CFI is present and actually asynchronous.
> > So it seems that in most cases, even with only synchronous unwind table,
> CFI is still correct.
> >
> > I believe we can trust eh_frame for frame 0 and use assembly profiling
> as fallback. If both failed, maybe code owner should use
> -fasynchronous-unwind-tables :-)
> >
> >
> > On Tue, Jul 29, 2014 at 4:59 PM, Jason Molenda <jmolenda at apple.com>
> wrote:
> > It was a tricky one and got lost in the shuffle of a busy week.  I was
> always reluctant to try profiling all the instructions in a function.  On
> x86, compiler generated code (gcc/clang anyway) is very simplistic about
> setting up the stack frame at the start and only having one epilogue - so
> anything fancier risked making mistakes and could possibly have a
> performance impact as we run functions through the disassembler.
> >
> > For hand-written assembly functions (which can be very creative with
> their prologue/epilogue and where it is placed), my position is that they
> should write eh_frame instructions in their assembly source to tell lldb
> where to find things.  There is one or two libraries on Mac OS X where we
> break the "ignore eh_frame for the currently executing function" because
> there are many hand-written assembly functions in there and the eh_frame is
> going to beat our own analysis.
> >
> >
> > After I wrote the x86 unwinder, Greg and Caroline implemented the arm
> unwinder where it emulates every instruction in the function looking for
> prologue/epilogue instructions.  We haven't seen it having a particularly
> bad impact performance-wise (lldb only does this disassembly for functions
> that it finds on stacks during an execution run, and it saves the result so
> it won't re-compute it for a given function).  The clang armv7 codegen
> often has mid-function epilogues (early returns) which definitely
> complicated things and made it necessary to step through the entire
> function bodies.  There's a bunch of code I added to support these
> mid-function epilogues - I have to save the register save state when I see
> an instruction which looks like an epilogue, and when I see the final ret
> instruction (aka restoring the saved lr contents into pc), I re-install the
> register save state from before the epilogue started.
> >
> > These things always make me a little nervous because the instruction
> analyzer obviously is doing a static analysis so it knows nothing about
> flow control.  Tong's patch stops when it sees the first CALL instruction -
> but that's not right, that's just solving the problem for his particular
> function which doesn't have any CALL instructions before his prologue. :)
> You could imagine a function which saves a couple of registers, calls
> another function, then saves a couple more because it needs more scratch
> registers.
> >
> > If we're going to change to profiling deep into the function -- and I'm
> not opposed to doing that, it's been fine on arm -- we should just do the
> entire function I think.
> >
> >
> > Another alternative would be to trust eh_frame on x86_64 at frame 0.
>  This is one of those things where there's not a great solution.  The
> unwind instructions in eh_frame are only guaranteed to be accurate for
> synchronous unwinds -- that is, they are only guaranteed to be accurate at
> places where an exception could be thrown - at call sites.  So for
> instances, there's no reason why the compiler has to describe the function
> prologue instructions at all.  There's no requirement that the eh_frame
> instructions describe the epilogue instructions.  The information about
> spilled registers only needs to be emitted where we could throw an
> exception, or where a callee could throw an exception.
> >
> > clang/gcc both emit detailed instructions for the prologue setup.  But
> for i386 codegen if the compiler needs to access some pc-relative data, it
> will do a "call next-instruction; pop %eax" to get the current pc value.
>  (x86_64 has rip-relative addressing so this isn't needed)  If you're
> debugging -fomit-frame-pointer code, that means your CFA is expressed in
> terms of the stack pointer and the stack pointer just changed mid-function
> --- and eh_frame instructions don't describe this.
> >
> > The end result: If you want accurate unwinds 100% of the time, you can't
> rely on the unwind instructions from eh_frame.  But they'll get you
> accurate unwinds 99.9% of the time ...  also, last I checked, neither clang
> nor gcc describe the epilogue instructions.
> >
> >
> > In *theory* the unwind instructions from the DWARF debug_frame section
> should be asynchronous -- they should describe how to find the CFA address
> for every instruction in the function.  Which makes sense - you want
> eh_frame to be compact because it's bundled into the executable, so it
> should only have the information necessary for exception handling and you
> can put the verbose stuff in debug_frame DWARF for debuggers.  But instead
> (again, last time I checked), the compilers put the exact same thing in
> debug_frame even if you use the -fasynchronous-unwind-tables (or whatever
> that switch was) option.
> >
> >
> > So I don't know, maybe we should just start trusting eh_frame at frame 0
> and write off those .1% cases where it isn't correct instead of trying to
> get too fancy with the assembly analysis code.
> >
> >
> >
> > > On Jul 29, 2014, at 4:17 PM, Todd Fiala <tfiala at google.com> wrote:
> > >
> > > Hey Jason,
> > >
> > > Do you have any feedback on this?
> > >
> > > Thanks!
> > >
> > > -Todd
> > >
> > >
> > > On Fri, Jul 25, 2014 at 1:42 PM, Tong Shen <endlessroad at google.com>
> wrote:
> > > Sorry, wrong version of patch...
> > >
> > >
> > > On Fri, Jul 25, 2014 at 1:41 PM, Tong Shen <endlessroad at google.com>
> wrote:
> > > Hi Molenda, lldb-commits,
> > >
> > > For now, x86 assembly profiler will stop after 10 "non-prologue"
> instructions. In practice it may not be sufficient. For example, we have a
> hand-written assembly function, which have hundreds of instruction before
> actual (stack-adjusting) prologue instructions.
> > >
> > > One way is to change the limit to 1000; but there will always be
> functions that break the limit :-) I believe the right thing to do here is
> parsing all instructions before "ret"/"call" as prologue instructions.
> > >
> > > Here's what I changed:
> > > - For "push %rbx" and "mov %rbx, -8(%rbp)": only add first row for
> that register. They may appear multiple times in function body. But as long
> as one of them appears, first appearance should be in prologue(If it's not
> in prologue, this function will not use %rbx, so these 2 instructions
> should not appear at all).
> > > - Also monitor "add %rsp 0x20".
> > > - Remove non prologue instruction count.
> > > - Add "call" instruction detection, and stop parsing after it.
> > >
> > > Thanks.
> > >
> > > --
> > > Best Regards, Tong Shen
> > >
> > >
> > >
> > > --
> > > Best Regards, Tong Shen
> > >
> > > _______________________________________________
> > > lldb-commits mailing list
> > > lldb-commits at cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/lldb-commits
> > >
> > >
> > >
> > >
> > > --
> > > Todd Fiala |   Software Engineer |     tfiala at google.com |
> 650-943-3180
> > >
> >
> >
> >
> >
> > --
> > Best Regards, Tong Shen
>
>

-- 
Best Regards, Tong Shen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20140730/f3358839/attachment.html>