[Lldb-commits] [PATCH] Profile Assembly Until Ret Instruction

Thu Aug 14 18:31:43 PDT 2014

Hi Jason,

Turns out we still need CFI for frame 0 in certain situations...

A possible approach is to disassemble machine code, and manually adjust CFI
for frame 0. For example, if we see "pop ebp; => ret", we set cfa to [esp];
if we see "call next-insn; => pop %ebp", we set cfa_offset+=4.

Patch attached, now it just implements adjustment for "pop ebp; ret".

If you think this approach is OK, I will go ahead and add other tricks(i386
pc relative addressing, more styles of epilogue, etc).

Thank you for your time!

On Thu, Jul 31, 2014 at 12:50 PM, Tong Shen <endlessroad at google.com> wrote:

> I think gdb's rationale for using CFI for leaf function is:
> - gcc always generate CFI for progolue, so at function entry, we know the
> correct CFA;
> - any stack pointer altering operation after that(mid-function &
> epilogue), we can recognize and handle them.
> So basically, it assumes 2, hacks its way through 3 & 4, and pretends we
> are at 5.
> Number of hacks we need seems to be small in x86 world, so this tradition
> is still here.
>
> Here's what gdb does for epilogue: normally when you run 'n', it will run
> one instruction a time till the next line/different stack id. But when it
> sees "pop %rbp; ret", it won't step into these instructions. Instead it
> will execute past them directly.
> I didn't experiment with x86 pc-relative addressing; but I guess it will
> also recognize and execute past this pattern directly.
>
> So for compiler generated functions, what we do now with assembly parser
> now can be done with CFI + those gdb hacks.
> And for hand-written assembly, i think CFI is almost always precise at
> instruction level. In this case, utilizing CFI instead of assembly parser
> will be a big help.
>
> So maybe we can apply those hacks, and trust CFI only for x86 & x86_64
> targets?
>
>
> On Thu, Jul 31, 2014 at 12:02 AM, Jason Molenda <jmolenda at apple.com>
> wrote:
>
>> I think we could think of five levels of eh_frame information:
>>
>>
>> 1 unwind instructions at exception throw locations & locations where a
>> callee may throw an exception
>>
>> 2 unwind instructions that describe the prologue
>>
>> 3 unwind instructions that describe the epilogue at the end of the
>> function
>>
>> 4 unwind instructions that describe mid-function epilogues (I see these
>> on arm all the time, don't see them on x86 with compiler generated code -
>> but we don't use eh_frame on arm at Apple, I'm just mentioning it for
>> completeness)
>>
>> 5 unwind instructions that describe any changes mid-function needed to
>> unwind at all instructions ("asynchronous unwind information")
>>
>>
>> The eh_frame section only guarantees #1.  gcc and clang always do #1 and
>> #2.  Modern gcc's do #3.  I don't know if gcc would do #4 on arm but it's
>> not important, I just mention it for completeness.  And no one does #5 (as
>> far as I know), even in the DWARF debug_frame section.
>>
>> I think it maybe possible to detect if an eh_frame entry fulfills #3 by
>> looking if the CFA definition on the last row is the same as the initial
>> CFA definition.  But I'm not sure how a debugger could use heuristics to
>> determine much else.
>>
>>
>> In fact, detecting #3 may be the easiest thing to detect.  I'm not sure
>> if the debugger could really detect #2 except maybe if the function had a
>> standard prologue (push rbp, mov rsp rbp) and the eh_frame didn't describe
>> the effects of these instructions, the debugger could know that the
>> eh_frame does not describe the prologue.
>>
>>
>>
>>
>> > On Jul 30, 2014, at 6:58 PM, Tong Shen <endlessroad at google.com> wrote:
>> >
>> > Ah I understand now.
>> >
>> > Now prologue seems always included in CFI fro gcc & clang; and newer
>> gcc includes epilogue as well.
>> > Maybe we can detect and use them when they are available?
>> >
>> >
>> > On Wed, Jul 30, 2014 at 6:44 PM, Jason Molenda <jmolenda at apple.com>
>> wrote:
>> > Ah, it looks like gcc changed since I last looked at its eh_frame
>> output.
>> >
>> > It's not a bug -- the eh_frame unwind instructions only need to be
>> accurate at instructions where an exception can be thrown, or where a
>> callee function can throw an exception.  There's no requirement to include
>> prologue or epilogue instructions in the eh_frame.
>> >
>> > And unfortunately from lldb's perspective, when we see eh_frame we'll
>> never know how descriptive it is.  If it's old-gcc or clang, it won't
>> include epilogue instructions.  If it's from another compiler, it may not
>> include any prologue/epilogue instructions at all.
>> >
>> > Maybe we could look over the UnwindPlan rows and see if the CFA
>> definition of the last row matches the initial row's CFA definition.  That
>> would show that the epilogue is described.  Unless it is a tail-call (aka
>> noreturn) function - in which case the stack is never restored.
>> >
>> >
>> >
>> >
>> > > On Jul 30, 2014, at 6:32 PM, Tong Shen <endlessroad at google.com>
>> wrote:
>> > >
>> > > GCC seems to generate a row for epilogue.
>> > > Do you think this is a clang bug, or at least a discrepancy between
>> clang & gcc?
>> > >
>> > > Source:
>> > > int f() {
>> > >       puts("HI\n");
>> > >       return 5;
>> > > }
>> > >
>> > > Compile option: only -g
>> > >
>> > > gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
>> > > clang version 3.5.0 (213114)
>> > >
>> > > Env: Ubuntu 14.04, x86_64
>> > >
>> > > drawfdump -F of clang binary:
>> > > <    2><0x00400530:0x00400559><f><fde offset 0x00000088 length:
>> 0x0000001c><eh aug data len 0x0>
>> > >         0x00400530: <off cfa=08(r7) > <off r16=-8(cfa) >
>> > >         0x00400531: <off cfa=16(r7) > <off r6=-16(cfa) > <off
>> r16=-8(cfa) >
>> > >         0x00400534: <off cfa=16(r6) > <off r6=-16(cfa) > <off
>> r16=-8(cfa) >
>> > >
>> > > drawfdump -F of gcc binary:
>> > > <    1><0x0040052d:0x00400542><f><fde offset 0x00000070 length:
>> 0x0000001c><eh aug data len 0x0>
>> > >         0x0040052d: <off cfa=08(r7) > <off r16=-8(cfa) >
>> > >         0x0040052e: <off cfa=16(r7) > <off r6=-16(cfa) > <off
>> r16=-8(cfa) >
>> > >         0x00400531: <off cfa=16(r6) > <off r6=-16(cfa) > <off
>> r16=-8(cfa) >
>> > >         0x00400541: <off cfa=08(r7) > <off r6=-16(cfa) > <off
>> r16=-8(cfa) >
>> > >
>> > >
>> > > On Wed, Jul 30, 2014 at 5:43 PM, Jason Molenda <jmolenda at apple.com>
>> wrote:
>> > > I'm open to trying to trust eh_frame at frame 0 for x86_64.  The lack
>> of epilogue descriptions in eh_frame is the biggest problem here.
>> > >
>> > > When you "step" or "next" in the debugger, the debugger instruction
>> steps across the source line until it gets to the next source line.  Every
>> time it stops after an instruction step, it confirms that it is (1) between
>> the start and end pc values for the source line, and (2) that the "stack
>> id" (start address of the function + CFA address) is the same.  If it stops
>> and the stack id has changed, for a "next" command, it will backtrace one
>> stack frame to see if it stepped into a function.  If so, it sets a
>> breakpoint on the return address and continues.
>> > >
>> > > If you switch lldb to prefer eh_frame instructions for x86_64, e.g.
>> > >
>> > > Index: source/Plugins/Process/Utility/RegisterContextLLDB.cpp
>> > > ===================================================================
>> > > --- source/Plugins/Process/Utility/RegisterContextLLDB.cpp
>>  (revision 214344)
>> > > +++ source/Plugins/Process/Utility/RegisterContextLLDB.cpp
>>  (working copy)
>> > > @@ -791,6 +791,22 @@
>> > >          }
>> > >      }
>> > >
>> > > +    // For x86_64 debugging, let's try using the eh_frame
>> instructions even if this is the currently
>> > > +    // executing function (frame zero).
>> > > +    Target *target = exe_ctx.GetTargetPtr();
>> > > +    if (target
>> > > +        && (target->GetArchitecture().GetCore() ==
>> ArchSpec::eCore_x86_64_x86_64h
>> > > +            || target->GetArchitecture().GetCore() ==
>> ArchSpec::eCore_x86_64_x86_64))
>> > > +    {
>> > > +        unwind_plan_sp = func_unwinders_sp->GetUnwindPlanAtCallSite
>> (m_current_offset_backed_up_one);
>> > > +        int valid_offset = -1;
>> > > +        if (IsUnwindPlanValidForCurrentPC(unwind_plan_sp,
>> valid_offset))
>> > > +        {
>> > > +            UnwindLogMsgVerbose ("frame uses %s for full UnwindPlan,
>> preferred over assembly profiling on x86_64",
>> unwind_plan_sp->GetSourceName().GetCString());
>> > > +            return unwind_plan_sp;
>> > > +        }
>> > > +    }
>> > > +
>> > >      // Typically the NonCallSite UnwindPlan is the unwind created by
>> inspecting the assembly language instructions
>> > >      if (behaves_like_zeroth_frame)
>> > >      {
>> > >
>> > >
>> > > you'll find that you have to "next" twice to step out of a function.
>>  Why?  With a simple function like:
>> > >
>> > > * thread #1: tid = 0xaf31e, 0x0000000100000eb9 a.out`foo + 25 at
>> a.c:5, queue = 'com.apple.main-thread', stop reason = step over
>> > >     #0: 0x0000000100000eb9 a.out`foo + 25 at a.c:5
>> > >    2    int foo ()
>> > >    3    {
>> > >    4        puts("HI");
>> > > -> 5        return 5;
>> > >    6    }
>> > >    7
>> > >    8    int bar ()
>> > > (lldb) disass
>> > > a.out`foo at a.c:3:
>> > >    0x100000ea0:  pushq  %rbp
>> > >    0x100000ea1:  movq   %rsp, %rbp
>> > >    0x100000ea4:  subq   $0x10, %rsp
>> > >    0x100000ea8:  leaq   0x6b(%rip), %rdi          ; "HI"
>> > >    0x100000eaf:  callq  0x100000efa               ; symbol stub for:
>> puts
>> > >    0x100000eb4:  movl   $0x5, %ecx
>> > > -> 0x100000eb9:  movl   %eax, -0x4(%rbp)
>> > >    0x100000ebc:  movl   %ecx, %eax
>> > >    0x100000ebe:  addq   $0x10, %rsp
>> > >    0x100000ec2:  popq   %rbp
>> > >    0x100000ec3:  retq
>> > >
>> > >
>> > > if you do "next" lldb will instruction step, comparing the stack ID
>> at every stop, until it gets to 0x100000ec3 at which point the stack ID
>> will change.  The CFA address (which the eh_frame tells us is rbp+16) just
>> changed to the caller's CFA address because we're about to return.  The
>> eh_frame instructions really need to tell us that the CFA is now rsp+8 at
>> 0x100000ec3.
>> > >
>> > > The end result is that you need to "next" twice to step out of a
>> function.
>> > >
>> > > AssemblyParse_x86 has a special bit where it looks or the 'ret'
>> instruction sequence at the end of the function -
>> > >
>> > >    // Now look at the byte at the end of the AddressRange for a
>> limited attempt at describing the
>> > >     // epilogue.  We're looking for the sequence
>> > >
>> > >     //  [ 0x5d ] mov %rbp, %rsp
>> > >     //  [ 0xc3 ] ret
>> > >     //  [ 0xe8 xx xx xx xx ] call __stack_chk_fail  (this is
>> sometimes the final insn in the function)
>> > >
>> > >     // We want to add a Row describing how to unwind when we're
>> stopped on the 'ret' instruction where the
>> > >     // CFA is no longer defined in terms of rbp, but is now defined
>> in terms of rsp like on function entry.
>> > >
>> > >
>> > > and adds an extra row of unwind details for that instruction.
>> > >
>> > >
>> > > I mention x86_64 as being a possible good test case here because I
>> worry about the i386 picbase sequence (call next-instruction; pop $ebx)
>> which occurs a lot.  But for x86_64, my main concern is the epilogues.
>> > >
>> > >
>> > >
>> > > > On Jul 30, 2014, at 2:52 PM, Tong Shen <endlessroad at google.com>
>> wrote:
>> > > >
>> > > > Thanks Jason! That's a very informative post, clarify things a lot
>> :-)
>> > > >
>> > > > Well I have to admit that my patch is specifically for certain kind
>> of functions, and now I see that's not the general case.
>> > > >
>> > > > I did some experiment with gdb. gdb uses CFI for frame 0, either
>> x86 or x86_64. It looks for FDE of frame 0, and do CFA calculations
>> according to that.
>> > > >
>> > > > - For compiler generated functions: I think there are 2 usage
>> scenarios for frame 0: breakpoint and signal.
>> > > >     - Breakpoints are usually at source line boundary instead of
>> instruction boundary, and generally we won't be caught at stack pointer
>> changing locations, so CFI is still valid.
>> > > >     - For signal, synchronous unwind table may not be sufficient
>> here. But only stack changing instructions will cause incorrect CFA
>> calculation, so it' not always the case.
>> > > > - For hand written assembly functions: from what I've seen, most of
>> the time CFI is present and actually asynchronous.
>> > > > So it seems that in most cases, even with only synchronous unwind
>> table, CFI is still correct.
>> > > >
>> > > > I believe we can trust eh_frame for frame 0 and use assembly
>> profiling as fallback. If both failed, maybe code owner should use
>> -fasynchronous-unwind-tables :-)
>> > > >
>> > > >
>> > > > On Tue, Jul 29, 2014 at 4:59 PM, Jason Molenda <jmolenda at apple.com>
>> wrote:
>> > > > It was a tricky one and got lost in the shuffle of a busy week.  I
>> was always reluctant to try profiling all the instructions in a function.
>>  On x86, compiler generated code (gcc/clang anyway) is very simplistic
>> about setting up the stack frame at the start and only having one epilogue
>> - so anything fancier risked making mistakes and could possibly have a
>> performance impact as we run functions through the disassembler.
>> > > >
>> > > > For hand-written assembly functions (which can be very creative
>> with their prologue/epilogue and where it is placed), my position is that
>> they should write eh_frame instructions in their assembly source to tell
>> lldb where to find things.  There is one or two libraries on Mac OS X where
>> we break the "ignore eh_frame for the currently executing function" because
>> there are many hand-written assembly functions in there and the eh_frame is
>> going to beat our own analysis.
>> > > >
>> > > >
>> > > > After I wrote the x86 unwinder, Greg and Caroline implemented the
>> arm unwinder where it emulates every instruction in the function looking
>> for prologue/epilogue instructions.  We haven't seen it having a
>> particularly bad impact performance-wise (lldb only does this disassembly
>> for functions that it finds on stacks during an execution run, and it saves
>> the result so it won't re-compute it for a given function).  The clang
>> armv7 codegen often has mid-function epilogues (early returns) which
>> definitely complicated things and made it necessary to step through the
>> entire function bodies.  There's a bunch of code I added to support these
>> mid-function epilogues - I have to save the register save state when I see
>> an instruction which looks like an epilogue, and when I see the final ret
>> instruction (aka restoring the saved lr contents into pc), I re-install the
>> register save state from before the epilogue started.
>> > > >
>> > > > These things always make me a little nervous because the
>> instruction analyzer obviously is doing a static analysis so it knows
>> nothing about flow control.  Tong's patch stops when it sees the first CALL
>> instruction - but that's not right, that's just solving the problem for his
>> particular function which doesn't have any CALL instructions before his
>> prologue. :) You could imagine a function which saves a couple of
>> registers, calls another function, then saves a couple more because it
>> needs more scratch registers.
>> > > >
>> > > > If we're going to change to profiling deep into the function -- and
>> I'm not opposed to doing that, it's been fine on arm -- we should just do
>> the entire function I think.
>> > > >
>> > > >
>> > > > Another alternative would be to trust eh_frame on x86_64 at frame
>> 0.  This is one of those things where there's not a great solution.  The
>> unwind instructions in eh_frame are only guaranteed to be accurate for
>> synchronous unwinds -- that is, they are only guaranteed to be accurate at
>> places where an exception could be thrown - at call sites.  So for
>> instances, there's no reason why the compiler has to describe the function
>> prologue instructions at all.  There's no requirement that the eh_frame
>> instructions describe the epilogue instructions.  The information about
>> spilled registers only needs to be emitted where we could throw an
>> exception, or where a callee could throw an exception.
>> > > >
>> > > > clang/gcc both emit detailed instructions for the prologue setup.
>>  But for i386 codegen if the compiler needs to access some pc-relative
>> data, it will do a "call next-instruction; pop %eax" to get the current pc
>> value.  (x86_64 has rip-relative addressing so this isn't needed)  If
>> you're debugging -fomit-frame-pointer code, that means your CFA is
>> expressed in terms of the stack pointer and the stack pointer just changed
>> mid-function --- and eh_frame instructions don't describe this.
>> > > >
>> > > > The end result: If you want accurate unwinds 100% of the time, you
>> can't rely on the unwind instructions from eh_frame.  But they'll get you
>> accurate unwinds 99.9% of the time ...  also, last I checked, neither clang
>> nor gcc describe the epilogue instructions.
>> > > >
>> > > >
>> > > > In *theory* the unwind instructions from the DWARF debug_frame
>> section should be asynchronous -- they should describe how to find the CFA
>> address for every instruction in the function.  Which makes sense - you
>> want eh_frame to be compact because it's bundled into the executable, so it
>> should only have the information necessary for exception handling and you
>> can put the verbose stuff in debug_frame DWARF for debuggers.  But instead
>> (again, last time I checked), the compilers put the exact same thing in
>> debug_frame even if you use the -fasynchronous-unwind-tables (or whatever
>> that switch was) option.
>> > > >
>> > > >
>> > > > So I don't know, maybe we should just start trusting eh_frame at
>> frame 0 and write off those .1% cases where it isn't correct instead of
>> trying to get too fancy with the assembly analysis code.
>> > > >
>> > > >
>> > > >
>> > > > > On Jul 29, 2014, at 4:17 PM, Todd Fiala <tfiala at google.com>
>> wrote:
>> > > > >
>> > > > > Hey Jason,
>> > > > >
>> > > > > Do you have any feedback on this?
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > -Todd
>> > > > >
>> > > > >
>> > > > > On Fri, Jul 25, 2014 at 1:42 PM, Tong Shen <
>> endlessroad at google.com> wrote:
>> > > > > Sorry, wrong version of patch...
>> > > > >
>> > > > >
>> > > > > On Fri, Jul 25, 2014 at 1:41 PM, Tong Shen <
>> endlessroad at google.com> wrote:
>> > > > > Hi Molenda, lldb-commits,
>> > > > >
>> > > > > For now, x86 assembly profiler will stop after 10 "non-prologue"
>> instructions. In practice it may not be sufficient. For example, we have a
>> hand-written assembly function, which have hundreds of instruction before
>> actual (stack-adjusting) prologue instructions.
>> > > > >
>> > > > > One way is to change the limit to 1000; but there will always be
>> functions that break the limit :-) I believe the right thing to do here is
>> parsing all instructions before "ret"/"call" as prologue instructions.
>> > > > >
>> > > > > Here's what I changed:
>> > > > > - For "push %rbx" and "mov %rbx, -8(%rbp)": only add first row
>> for that register. They may appear multiple times in function body. But as
>> long as one of them appears, first appearance should be in prologue(If it's
>> not in prologue, this function will not use %rbx, so these 2 instructions
>> should not appear at all).
>> > > > > - Also monitor "add %rsp 0x20".
>> > > > > - Remove non prologue instruction count.
>> > > > > - Add "call" instruction detection, and stop parsing after it.
>> > > > >
>> > > > > Thanks.
>> > > > >
>> > > > > --
>> > > > > Best Regards, Tong Shen
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards, Tong Shen
>> > > > >
>> > > > > _______________________________________________
>> > > > > lldb-commits mailing list
>> > > > > lldb-commits at cs.uiuc.edu
>> > > > > http://lists.cs.uiuc.edu/mailman/listinfo/lldb-commits
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Todd Fiala |   Software Engineer |     tfiala at google.com |
>> 650-943-3180
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Best Regards, Tong Shen
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Tong Shen
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards, Tong Shen
>>
>>
>
>
> --
> Best Regards, Tong Shen
>

-- 
Best Regards, Tong Shen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20140814/b194a07b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: adjust_cfi_for_frame_zero.patch
Type: text/x-patch
Size: 6118 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20140814/b194a07b/attachment.bin>