[Lldb-commits] [PATCH] Profile Assembly Until Ret Instruction
Tong Shen
endlessroad at google.com
Wed Jul 30 18:37:25 PDT 2014
Sorry, assemblies:
clang binary:
0x0000000000400530 <+0>: push %rbp
0x0000000000400531 <+1>: mov %rsp,%rbp
0x0000000000400534 <+4>: sub $0x10,%rsp
0x0000000000400538 <+8>: movabs $0x400604,%rdi
0x0000000000400542 <+18>: mov $0x0,%al
0x0000000000400544 <+20>: callq 0x400410 <puts at plt>
0x0000000000400549 <+25>: mov $0x5,%ecx
0x000000000040054e <+30>: mov %eax,-0x4(%rbp)
0x0000000000400551 <+33>: mov %ecx,%eax
0x0000000000400553 <+35>: add $0x10,%rsp
0x0000000000400557 <+39>: pop %rbp
0x0000000000400558 <+40>: retq
gcc binary:
0x000000000040052d <+0>: push %rbp
0x000000000040052e <+1>: mov %rsp,%rbp
0x0000000000400531 <+4>: mov $0x4005e4,%edi
0x0000000000400536 <+9>: callq 0x400410 <puts at plt>
0x000000000040053b <+14>: mov $0x5,%eax
0x0000000000400540 <+19>: pop %rbp
0x0000000000400541 <+20>: retq
On Wed, Jul 30, 2014 at 6:32 PM, Tong Shen <endlessroad at google.com> wrote:
> GCC seems to generate a row for epilogue.
> Do you think this is a clang bug, or at least a discrepancy between clang
> & gcc?
>
> Source:
> int f() {
> puts("HI\n");
> return 5;
> }
>
> Compile option: only -g
>
> gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
> clang version 3.5.0 (213114)
>
> Env: Ubuntu 14.04, x86_64
>
> drawfdump -F of clang binary:
> < 2><0x00400530:0x00400559><f><fde offset 0x00000088 length:
> 0x0000001c><eh aug data len 0x0>
> 0x00400530: <off cfa=08(r7) > <off r16=-8(cfa) >
> 0x00400531: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa)
> >
> 0x00400534: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa)
> >
>
> drawfdump -F of gcc binary:
> < 1><0x0040052d:0x00400542><f><fde offset 0x00000070 length:
> 0x0000001c><eh aug data len 0x0>
> 0x0040052d: <off cfa=08(r7) > <off r16=-8(cfa) >
> 0x0040052e: <off cfa=16(r7) > <off r6=-16(cfa) > <off r16=-8(cfa)
> >
> 0x00400531: <off cfa=16(r6) > <off r6=-16(cfa) > <off r16=-8(cfa)
> >
> 0x00400541: <off cfa=08(r7) > <off r6=-16(cfa) > <off r16=-8(cfa)
> >
>
>
> On Wed, Jul 30, 2014 at 5:43 PM, Jason Molenda <jmolenda at apple.com> wrote:
>
>> I'm open to trying to trust eh_frame at frame 0 for x86_64. The lack of
>> epilogue descriptions in eh_frame is the biggest problem here.
>>
>> When you "step" or "next" in the debugger, the debugger instruction steps
>> across the source line until it gets to the next source line. Every time
>> it stops after an instruction step, it confirms that it is (1) between the
>> start and end pc values for the source line, and (2) that the "stack id"
>> (start address of the function + CFA address) is the same. If it stops and
>> the stack id has changed, for a "next" command, it will backtrace one stack
>> frame to see if it stepped into a function. If so, it sets a breakpoint on
>> the return address and continues.
>>
>> If you switch lldb to prefer eh_frame instructions for x86_64, e.g.
>>
>> Index: source/Plugins/Process/Utility/RegisterContextLLDB.cpp
>> ===================================================================
>> --- source/Plugins/Process/Utility/RegisterContextLLDB.cpp (revision
>> 214344)
>> +++ source/Plugins/Process/Utility/RegisterContextLLDB.cpp (working
>> copy)
>> @@ -791,6 +791,22 @@
>> }
>> }
>>
>> + // For x86_64 debugging, let's try using the eh_frame instructions
>> even if this is the currently
>> + // executing function (frame zero).
>> + Target *target = exe_ctx.GetTargetPtr();
>> + if (target
>> + && (target->GetArchitecture().GetCore() ==
>> ArchSpec::eCore_x86_64_x86_64h
>> + || target->GetArchitecture().GetCore() ==
>> ArchSpec::eCore_x86_64_x86_64))
>> + {
>> + unwind_plan_sp = func_unwinders_sp->GetUnwindPlanAtCallSite
>> (m_current_offset_backed_up_one);
>> + int valid_offset = -1;
>> + if (IsUnwindPlanValidForCurrentPC(unwind_plan_sp, valid_offset))
>> + {
>> + UnwindLogMsgVerbose ("frame uses %s for full UnwindPlan,
>> preferred over assembly profiling on x86_64",
>> unwind_plan_sp->GetSourceName().GetCString());
>> + return unwind_plan_sp;
>> + }
>> + }
>> +
>> // Typically the NonCallSite UnwindPlan is the unwind created by
>> inspecting the assembly language instructions
>> if (behaves_like_zeroth_frame)
>> {
>>
>>
>> you'll find that you have to "next" twice to step out of a function.
>> Why? With a simple function like:
>>
>> * thread #1: tid = 0xaf31e, 0x0000000100000eb9 a.out`foo + 25 at a.c:5,
>> queue = 'com.apple.main-thread', stop reason = step over
>> #0: 0x0000000100000eb9 a.out`foo + 25 at a.c:5
>> 2 int foo ()
>> 3 {
>> 4 puts("HI");
>> -> 5 return 5;
>> 6 }
>> 7
>> 8 int bar ()
>> (lldb) disass
>> a.out`foo at a.c:3:
>> 0x100000ea0: pushq %rbp
>> 0x100000ea1: movq %rsp, %rbp
>> 0x100000ea4: subq $0x10, %rsp
>> 0x100000ea8: leaq 0x6b(%rip), %rdi ; "HI"
>> 0x100000eaf: callq 0x100000efa ; symbol stub for: puts
>> 0x100000eb4: movl $0x5, %ecx
>> -> 0x100000eb9: movl %eax, -0x4(%rbp)
>> 0x100000ebc: movl %ecx, %eax
>> 0x100000ebe: addq $0x10, %rsp
>> 0x100000ec2: popq %rbp
>> 0x100000ec3: retq
>>
>>
>> if you do "next" lldb will instruction step, comparing the stack ID at
>> every stop, until it gets to 0x100000ec3 at which point the stack ID will
>> change. The CFA address (which the eh_frame tells us is rbp+16) just
>> changed to the caller's CFA address because we're about to return. The
>> eh_frame instructions really need to tell us that the CFA is now rsp+8 at
>> 0x100000ec3.
>>
>> The end result is that you need to "next" twice to step out of a function.
>>
>> AssemblyParse_x86 has a special bit where it looks or the 'ret'
>> instruction sequence at the end of the function -
>>
>> // Now look at the byte at the end of the AddressRange for a limited
>> attempt at describing the
>> // epilogue. We're looking for the sequence
>>
>> // [ 0x5d ] mov %rbp, %rsp
>> // [ 0xc3 ] ret
>> // [ 0xe8 xx xx xx xx ] call __stack_chk_fail (this is sometimes
>> the final insn in the function)
>>
>> // We want to add a Row describing how to unwind when we're stopped
>> on the 'ret' instruction where the
>> // CFA is no longer defined in terms of rbp, but is now defined in
>> terms of rsp like on function entry.
>>
>>
>> and adds an extra row of unwind details for that instruction.
>>
>>
>> I mention x86_64 as being a possible good test case here because I worry
>> about the i386 picbase sequence (call next-instruction; pop $ebx) which
>> occurs a lot. But for x86_64, my main concern is the epilogues.
>>
>>
>>
>> > On Jul 30, 2014, at 2:52 PM, Tong Shen <endlessroad at google.com> wrote:
>> >
>> > Thanks Jason! That's a very informative post, clarify things a lot :-)
>> >
>> > Well I have to admit that my patch is specifically for certain kind of
>> functions, and now I see that's not the general case.
>> >
>> > I did some experiment with gdb. gdb uses CFI for frame 0, either x86 or
>> x86_64. It looks for FDE of frame 0, and do CFA calculations according to
>> that.
>> >
>> > - For compiler generated functions: I think there are 2 usage scenarios
>> for frame 0: breakpoint and signal.
>> > - Breakpoints are usually at source line boundary instead of
>> instruction boundary, and generally we won't be caught at stack pointer
>> changing locations, so CFI is still valid.
>> > - For signal, synchronous unwind table may not be sufficient here.
>> But only stack changing instructions will cause incorrect CFA calculation,
>> so it' not always the case.
>> > - For hand written assembly functions: from what I've seen, most of the
>> time CFI is present and actually asynchronous.
>> > So it seems that in most cases, even with only synchronous unwind
>> table, CFI is still correct.
>> >
>> > I believe we can trust eh_frame for frame 0 and use assembly profiling
>> as fallback. If both failed, maybe code owner should use
>> -fasynchronous-unwind-tables :-)
>> >
>> >
>> > On Tue, Jul 29, 2014 at 4:59 PM, Jason Molenda <jmolenda at apple.com>
>> wrote:
>> > It was a tricky one and got lost in the shuffle of a busy week. I was
>> always reluctant to try profiling all the instructions in a function. On
>> x86, compiler generated code (gcc/clang anyway) is very simplistic about
>> setting up the stack frame at the start and only having one epilogue - so
>> anything fancier risked making mistakes and could possibly have a
>> performance impact as we run functions through the disassembler.
>> >
>> > For hand-written assembly functions (which can be very creative with
>> their prologue/epilogue and where it is placed), my position is that they
>> should write eh_frame instructions in their assembly source to tell lldb
>> where to find things. There is one or two libraries on Mac OS X where we
>> break the "ignore eh_frame for the currently executing function" because
>> there are many hand-written assembly functions in there and the eh_frame is
>> going to beat our own analysis.
>> >
>> >
>> > After I wrote the x86 unwinder, Greg and Caroline implemented the arm
>> unwinder where it emulates every instruction in the function looking for
>> prologue/epilogue instructions. We haven't seen it having a particularly
>> bad impact performance-wise (lldb only does this disassembly for functions
>> that it finds on stacks during an execution run, and it saves the result so
>> it won't re-compute it for a given function). The clang armv7 codegen
>> often has mid-function epilogues (early returns) which definitely
>> complicated things and made it necessary to step through the entire
>> function bodies. There's a bunch of code I added to support these
>> mid-function epilogues - I have to save the register save state when I see
>> an instruction which looks like an epilogue, and when I see the final ret
>> instruction (aka restoring the saved lr contents into pc), I re-install the
>> register save state from before the epilogue started.
>> >
>> > These things always make me a little nervous because the instruction
>> analyzer obviously is doing a static analysis so it knows nothing about
>> flow control. Tong's patch stops when it sees the first CALL instruction -
>> but that's not right, that's just solving the problem for his particular
>> function which doesn't have any CALL instructions before his prologue. :)
>> You could imagine a function which saves a couple of registers, calls
>> another function, then saves a couple more because it needs more scratch
>> registers.
>> >
>> > If we're going to change to profiling deep into the function -- and I'm
>> not opposed to doing that, it's been fine on arm -- we should just do the
>> entire function I think.
>> >
>> >
>> > Another alternative would be to trust eh_frame on x86_64 at frame 0.
>> This is one of those things where there's not a great solution. The
>> unwind instructions in eh_frame are only guaranteed to be accurate for
>> synchronous unwinds -- that is, they are only guaranteed to be accurate at
>> places where an exception could be thrown - at call sites. So for
>> instances, there's no reason why the compiler has to describe the function
>> prologue instructions at all. There's no requirement that the eh_frame
>> instructions describe the epilogue instructions. The information about
>> spilled registers only needs to be emitted where we could throw an
>> exception, or where a callee could throw an exception.
>> >
>> > clang/gcc both emit detailed instructions for the prologue setup. But
>> for i386 codegen if the compiler needs to access some pc-relative data, it
>> will do a "call next-instruction; pop %eax" to get the current pc value.
>> (x86_64 has rip-relative addressing so this isn't needed) If you're
>> debugging -fomit-frame-pointer code, that means your CFA is expressed in
>> terms of the stack pointer and the stack pointer just changed mid-function
>> --- and eh_frame instructions don't describe this.
>> >
>> > The end result: If you want accurate unwinds 100% of the time, you
>> can't rely on the unwind instructions from eh_frame. But they'll get you
>> accurate unwinds 99.9% of the time ... also, last I checked, neither clang
>> nor gcc describe the epilogue instructions.
>> >
>> >
>> > In *theory* the unwind instructions from the DWARF debug_frame section
>> should be asynchronous -- they should describe how to find the CFA address
>> for every instruction in the function. Which makes sense - you want
>> eh_frame to be compact because it's bundled into the executable, so it
>> should only have the information necessary for exception handling and you
>> can put the verbose stuff in debug_frame DWARF for debuggers. But instead
>> (again, last time I checked), the compilers put the exact same thing in
>> debug_frame even if you use the -fasynchronous-unwind-tables (or whatever
>> that switch was) option.
>> >
>> >
>> > So I don't know, maybe we should just start trusting eh_frame at frame
>> 0 and write off those .1% cases where it isn't correct instead of trying to
>> get too fancy with the assembly analysis code.
>> >
>> >
>> >
>> > > On Jul 29, 2014, at 4:17 PM, Todd Fiala <tfiala at google.com> wrote:
>> > >
>> > > Hey Jason,
>> > >
>> > > Do you have any feedback on this?
>> > >
>> > > Thanks!
>> > >
>> > > -Todd
>> > >
>> > >
>> > > On Fri, Jul 25, 2014 at 1:42 PM, Tong Shen <endlessroad at google.com>
>> wrote:
>> > > Sorry, wrong version of patch...
>> > >
>> > >
>> > > On Fri, Jul 25, 2014 at 1:41 PM, Tong Shen <endlessroad at google.com>
>> wrote:
>> > > Hi Molenda, lldb-commits,
>> > >
>> > > For now, x86 assembly profiler will stop after 10 "non-prologue"
>> instructions. In practice it may not be sufficient. For example, we have a
>> hand-written assembly function, which have hundreds of instruction before
>> actual (stack-adjusting) prologue instructions.
>> > >
>> > > One way is to change the limit to 1000; but there will always be
>> functions that break the limit :-) I believe the right thing to do here is
>> parsing all instructions before "ret"/"call" as prologue instructions.
>> > >
>> > > Here's what I changed:
>> > > - For "push %rbx" and "mov %rbx, -8(%rbp)": only add first row for
>> that register. They may appear multiple times in function body. But as long
>> as one of them appears, first appearance should be in prologue(If it's not
>> in prologue, this function will not use %rbx, so these 2 instructions
>> should not appear at all).
>> > > - Also monitor "add %rsp 0x20".
>> > > - Remove non prologue instruction count.
>> > > - Add "call" instruction detection, and stop parsing after it.
>> > >
>> > > Thanks.
>> > >
>> > > --
>> > > Best Regards, Tong Shen
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards, Tong Shen
>> > >
>> > > _______________________________________________
>> > > lldb-commits mailing list
>> > > lldb-commits at cs.uiuc.edu
>> > > http://lists.cs.uiuc.edu/mailman/listinfo/lldb-commits
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Todd Fiala | Software Engineer | tfiala at google.com |
>> 650-943-3180
>> > >
>> >
>> >
>> >
>> >
>> > --
>> > Best Regards, Tong Shen
>>
>>
>
>
> --
> Best Regards, Tong Shen
>
--
Best Regards, Tong Shen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20140730/cedc884b/attachment.html>
More information about the lldb-commits
mailing list