[Lldb-commits] [PATCH] D37651: Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)

Mon Sep 11 12:25:39 PDT 2017

+Mark Mentovai

On Mon, Sep 11, 2017 at 12:24 PM, Leonard Mosescu <mosescu at google.com>
wrote:

> Thanks Jim, I'll give option #1 a try and upload a new patch.
>
>  think your divisions into options is too simplistic.  lldb is built of
>> many different subsystems, and a lot of "inconsistent state" can be
>> confined to one subsystem or other.  So for instance, if one CU's debug is
>> unrecoverably damaged, we can just pretend that that .o file was built w/o
>> debug info and the user can do lots of useful debugging still.  It is much
>> better if we can say "Found irrecoverable errors in this CU, dumping debug
>> info" and continue rather than crashing because one CU that the user may
>> not even care about is all horked up.  Similarly a single JIT execution can
>> get itself all balled up and yet if we can discard the work it did we can
>> continue doing lots of other things.
>
>
> The key here is the recoverable part - if you can isolate the state and
> recover (throw-away, rollback), then I agree, that can be a preferable
> approach. My point is about truly irrecoverable states - WTF states where
> you don't expect to get into, can't do anything meaningful in response and
> will likely cause cascading issues.
>
> As a side note, I would *love* a generic roll-back mechanism in LLDB. Not
> only that it can give us a viable alternative to failing fast, but it can
> also allow things like LLDB-command interruption (which is one thing I'd
> like to take a closer look when I get a chance)
>
> On Mon, Sep 11, 2017 at 10:55 AM, Jim Ingham <jingham at apple.com> wrote:
>
>>
>> > On Sep 11, 2017, at 10:24 AM, Leonard Mosescu <mosescu at google.com>
>> wrote:
>> >
>> > I think everyone is in agreement that input should be sanitized, and
>> ill-formed input should not trigger an abrupt termination if there's a way
>> to recover (which is not always feasible).
>> >
>> > The conversation started around fail-fast for inconsistent and
>> unrecoverable state, where I think the math is very clear in this case.
>> Using one the the comments about the long-lasting debugging sessions as
>> example, let's say that we have a bug and the debugger state gets corrupted
>> after 1h of debugging:
>> > 1. debugger crashes immediately: lost 1h of debugging -> angry bug
>> report (or ideally automated bug report)
>> > 2. debugger limps along for a few more hours behaving erratically (not
>> necessarily in an obvious fashion)
>> > a. developer gives up after many hours or debugging without
>> understanding what's going on
>> > b. developer figures out that the debugger is unreliable
>> >
>> > While it is choosing between two evils, I think #1 is clearly the right
>> choice: #2 results in more lost time and frustration, less chance of
>> getting a bug report and can result in a general impression that the
>> debugger is not to be trusted (which for a developer tool is infinitely
>> worse than an unstable tool)
>>
>> I think your divisions into options is too simplistic.  lldb is built of
>> many different subsystems, and a lot of "inconsistent state" can be
>> confined to one subsystem or other.  So for instance, if one CU's debug is
>> unrecoverably damaged, we can just pretend that that .o file was built w/o
>> debug info and the user can do lots of useful debugging still.  It is much
>> better if we can say "Found irrecoverable errors in this CU, dumping debug
>> info" and continue rather than crashing because one CU that the user may
>> not even care about is all horked up.  Similarly a single JIT execution can
>> get itself all balled up and yet if we can discard the work it did we can
>> continue doing lots of other things.
>>
>> So I think there are lots of places where you can get into a bad state
>> locally, but that doesn't mean #2 is the outcome.  Sometimes you even get
>> into a state where a specific target has done something totally
>> incomprehensible.  In that case, you may have to jettison that target, but
>> that might not be the only target/debugger in this lldb session.  So you
>> still wouldn't want to quit, those targets are probably fine.
>>
>> As you note, this is a little off from the original discussion which was
>> more about setting up and enforcing logic flows using the lldb_private
>> API's.  If you really have to call B only after calling A, and you can
>> clearly signal to yourself that A was called, then it is okay put an assert
>> there.  But in your case, CanResume seems like an optimization - and you
>> are clearly using it as such to avoid having to fix something deeper down
>> in lldb.  But in that case DoResume can error for other reasons, and so the
>> system should deal with it.  So adding another hard requirement was
>> inelegant and exiting if it wasn't met just made that more obvious...
>>
>> >
>> > --------------------------------------
>> >
>> > Going back to the original fix, I think most observations are very
>> pertinent - yes, the result is not ideal. This is because the current
>> logic, changing the state in Process::Resume() before the whole operation
>> completes successfully is ... unfortunate. So here are the options I see:
>> >
>> > 1. Fix Process::Resume() so we only change the state late, after
>> everything succeds
>> > 2. Rollback state change if anything fails while resuming
>> > 3. Patch the specific case of "resume is not supported"
>> > 4. Do nothing
>> >
>> > The proposed fix is #3, since it's the less intrusive (ie. doesn't
>> change the core resume logic) patch for a very specific and narrow case.
>> IMO the "right" fix is #1, but it seems a more radical change. #2 seems
>> tempting in terms of code changes, but I find it less compelling overall
>> since rollbacks are generally fragile if the code is not designed with that
>> in mind.
>> >
>> > So, LLDB experts, what do you think? I'd be happy to go with #1 if
>> someone with experience in this area can review the change. Anything else
>> that I missed?
>> >
>>
>> Seems like #1 is the right choice if you're willing to give it a try.
>> You are right that this code is tricky, but it would be worth the effort to
>> fix it right, if for no other reason than to get some other eyes on it.
>> I'm happy to answer any questions as you go along.  This code has
>> ping-ponged back and forth between Greg and me, so we should both be able
>> to help you out.
>>
>> Jim
>>
>> > Thanks!
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Sun, Sep 10, 2017 at 1:52 PM, Jim Ingham via lldb-commits <
>> lldb-commits at lists.llvm.org> wrote:
>> >
>> >> On Sep 9, 2017, at 7:38 PM, Zachary Turner <zturner at google.com> wrote:
>> >>
>> >> Sure, but reading a core file involves handling user input. Obviously
>> that has to be sanitized and you can't crash on user input. I don't think
>> there's ever been any disagreement about that. On the other hand, if you
>> just type an expression in the debugger and expect an answer back, after
>> clang says the ast is valid, the whole stack should be asserts all the way
>> down, because everything is validated
>> >
>> > Yes, but even that ends up not matching with the facts on the ground,
>> because the debug information for types as used by lldb has to be produced
>> from debug info, and the debug info can be malformed and not infrequently
>> is.  So at some point clang is going to ask for something that it assumes
>> must always be available because if it had read the types in from header
>> files it would have asserted at an early stage but when it is fed types
>> from debug info incrementally - which is the only efficient way the
>> debugger can do it - is going to cause something to look wrong at a point
>> where that “shouldn’t be possible.”  clang's rigid structure means it can’t
>> cope with this and will crash on something the user didn’t themselves get
>> wrong.
>> >
>> > According to your classification, we should really consider debug
>> information “user input” as well.  But that means we’re wiring “user input”
>> into a pretty deep place in clang and you’d probably have to weaken your
>> strict assert model to handle that. And it makes me wonder if there’s
>> anything the debugger consumes that wouldn’t fall into this category?
>> >
>> > For instance, we also let plugins provide threads and their stacks, and
>> we try pretty hard not to crash and burn if these plugins get it a little
>> wrong.  So process state really needs to be treated as “user data” as
>> well.  And since we use SPI’s for most of this info that we are pretty much
>> the only clients of, they can be expected to be flakey as well, again a
>> fact that we should not inflict on users of the debugger if we can help it.
>> >
>> > The answer of trying to do everything dangerous out of process I think
>> will make a  slow and fragile architecture.  And while I agree with the
>> llvm.org decision not to use C++ exceptions because they make reasoning
>> about the program more difficult, replacing that with SIGSEGV as the
>> exception mechanism of choice seems really weak.  You are not the first
>> person to point out that we could use the crash-protected threads, but I’m
>> remain very unenthusiastic about this approach...
>> >
>> > Moreover, there are a bunch of problematic areas in lldb, but I don’t
>> think any of them are problematic for the reasons that you are describing
>> here.  There are areas of weak memory management in the data formatters &
>> some places where we don’t manage locking right in handling process stops &
>> the reconstruction of program state after that, and those cause the vast
>> majority of the infrequent but annoying crashes we see.  I don’t myself
>> believe that putting in this assert everywhere architecture would pay off
>> in reduced crashes to compensate for the brittleness it would introduce.
>> >
>> > Finally, and most crucially, no failure of any expression evaluation
>> and no malformed type is worth taking down a debug session, and it’s our
>> responsibility working on lldb to make sure it doesn't.  If the makes our
>> job a little harder, so be it.  This was Jason’s point earlier, but it is
>> worth reiterating because it makes lldb a very different kind of program
>> from all the other programs in the suite collected under the llvm project.
>> If the compiler gets into an inconsistent state compiling a source file,
>> it’s fine for it to just exit.  It wasn’t going to do any more useful work
>> anyway.  That is definitely NOT true of lldb, and makes reasonable the
>> notion that general solutions that seem appropriate for the other tools are
>> not appropriate for lldb.
>> >
>> >
>> > Jim
>> >
>> >
>> >> On Sat, Sep 9, 2017 at 6:53 PM Jim Ingham <jingham at apple.com> wrote:
>> >> I think we are talking at cross purposes.  Seems to me you are saying
>> “If we can assert that the answers to questions I ask must always
>> copacetic, then we can express that in software, and that will make things
>> so much easier".  I’m saying "my experience of the data debuggers have to
>> deal with is such that assuming such assertions will lead you to over-react
>> to such errors.  Instead you should write your code so that if somebody
>> gives you bad data you don’t fall over allowing the people who called you
>> to decide how important the error was.”  Every core file that was written
>> out by OS X for years had a section that was ill-formed.  Asserting when
>> you get an ill-formed object file might seem a good way to ensure that you
>> don’t have to make guesses that might lead you astray.  But the people who
>> need to load core files on OS X are rightly unmoved by such arguments if
>> lldb disappears out from under them when reading in core files.
>> >>
>> >> Jim
>> >>
>> >>
>> >>
>> >>> On Sep 9, 2017, at 1:31 PM, Zachary Turner <zturner at google.com>
>> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On Sat, Sep 9, 2017 at 12:04 PM Jim Ingham <jingham at apple.com> wrote:
>> >>>
>> >>> I disagree here.  If you can reasonably unwind from an error you
>> should do so even if you can’t figure out how you could have gotten an
>> answer you didn’t expect.  If an API is returning a pointer to something
>> you should assume it might return nullptr unless the API explicitly states
>> otherwise.
>> >>> But that's exactly what an assert is.  It's an explicit statement by
>> the API about what should happen.  Which is why by adding them liberally,
>> these assumptions can then be propagated all the way up through many layers
>> of the code, vastly simplifying the codebase.
>> >>>
>> >>> if you have
>> >>>
>> >>> void *foo(int x) {
>> >>>   // do some stuff
>> >>>
>> >>>   assert(x < 0 || foo != nullptr);
>> >>> }
>> >>>
>> >>> Then you're documenting that if x is greater than 0, the caller
>> doesn't need to check the return value for nullptr.  Now instead of this:
>> >>>
>> >>> void *bar(unsigned x) {
>> >>>   void *ptr = foo(x);
>> >>>   if (!ptr) {
>> >>>     // log an error
>> >>>     return nullptr;
>> >>>   }
>> >>>   return ptr;
>> >>> }
>> >>>
>> >>> You just have
>> >>>
>> >>> void *bar(unsigned x) {
>> >>>   void *ptr = foo(x);
>> >>>   assert(x);
>> >>>   return x;
>> >>> }
>> >>>
>> >>> And now the caller of bar doesn't have to check either.  The code has
>> greatly reduced complexity due to the butterfly efflect of propagating
>> these assumptions up.
>> >>>
>> >>> This is a simple example but the point is that building assumptions
>> into your API is a good thing, because you can enforce them and it vastly
>> simplifies the code.
>> >>
>> >
>> >
>> > _______________________________________________
>> > lldb-commits mailing list
>> > lldb-commits at lists.llvm.org
>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
>> >
>> >
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20170911/2272a2d4/attachment-0001.html>