[Lldb-commits] [PATCH] D37651: Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)

Mon Sep 11 12:24:36 PDT 2017

Thanks Jim, I'll give option #1 a try and upload a new patch.

 think your divisions into options is too simplistic.  lldb is built of
> many different subsystems, and a lot of "inconsistent state" can be
> confined to one subsystem or other.  So for instance, if one CU's debug is
> unrecoverably damaged, we can just pretend that that .o file was built w/o
> debug info and the user can do lots of useful debugging still.  It is much
> better if we can say "Found irrecoverable errors in this CU, dumping debug
> info" and continue rather than crashing because one CU that the user may
> not even care about is all horked up.  Similarly a single JIT execution can
> get itself all balled up and yet if we can discard the work it did we can
> continue doing lots of other things.

The key here is the recoverable part - if you can isolate the state and
recover (throw-away, rollback), then I agree, that can be a preferable
approach. My point is about truly irrecoverable states - WTF states where
you don't expect to get into, can't do anything meaningful in response and
will likely cause cascading issues.

As a side note, I would *love* a generic roll-back mechanism in LLDB. Not
only that it can give us a viable alternative to failing fast, but it can
also allow things like LLDB-command interruption (which is one thing I'd
like to take a closer look when I get a chance)

On Mon, Sep 11, 2017 at 10:55 AM, Jim Ingham <jingham at apple.com> wrote:

>
> > On Sep 11, 2017, at 10:24 AM, Leonard Mosescu <mosescu at google.com>
> wrote:
> >
> > I think everyone is in agreement that input should be sanitized, and
> ill-formed input should not trigger an abrupt termination if there's a way
> to recover (which is not always feasible).
> >
> > The conversation started around fail-fast for inconsistent and
> unrecoverable state, where I think the math is very clear in this case.
> Using one the the comments about the long-lasting debugging sessions as
> example, let's say that we have a bug and the debugger state gets corrupted
> after 1h of debugging:
> > 1. debugger crashes immediately: lost 1h of debugging -> angry bug
> report (or ideally automated bug report)
> > 2. debugger limps along for a few more hours behaving erratically (not
> necessarily in an obvious fashion)
> > a. developer gives up after many hours or debugging without
> understanding what's going on
> > b. developer figures out that the debugger is unreliable
> >
> > While it is choosing between two evils, I think #1 is clearly the right
> choice: #2 results in more lost time and frustration, less chance of
> getting a bug report and can result in a general impression that the
> debugger is not to be trusted (which for a developer tool is infinitely
> worse than an unstable tool)
>
> I think your divisions into options is too simplistic.  lldb is built of
> many different subsystems, and a lot of "inconsistent state" can be
> confined to one subsystem or other.  So for instance, if one CU's debug is
> unrecoverably damaged, we can just pretend that that .o file was built w/o
> debug info and the user can do lots of useful debugging still.  It is much
> better if we can say "Found irrecoverable errors in this CU, dumping debug
> info" and continue rather than crashing because one CU that the user may
> not even care about is all horked up.  Similarly a single JIT execution can
> get itself all balled up and yet if we can discard the work it did we can
> continue doing lots of other things.
>
> So I think there are lots of places where you can get into a bad state
> locally, but that doesn't mean #2 is the outcome.  Sometimes you even get
> into a state where a specific target has done something totally
> incomprehensible.  In that case, you may have to jettison that target, but
> that might not be the only target/debugger in this lldb session.  So you
> still wouldn't want to quit, those targets are probably fine.
>
> As you note, this is a little off from the original discussion which was
> more about setting up and enforcing logic flows using the lldb_private
> API's.  If you really have to call B only after calling A, and you can
> clearly signal to yourself that A was called, then it is okay put an assert
> there.  But in your case, CanResume seems like an optimization - and you
> are clearly using it as such to avoid having to fix something deeper down
> in lldb.  But in that case DoResume can error for other reasons, and so the
> system should deal with it.  So adding another hard requirement was
> inelegant and exiting if it wasn't met just made that more obvious...
>
> >
> > --------------------------------------
> >
> > Going back to the original fix, I think most observations are very
> pertinent - yes, the result is not ideal. This is because the current
> logic, changing the state in Process::Resume() before the whole operation
> completes successfully is ... unfortunate. So here are the options I see:
> >
> > 1. Fix Process::Resume() so we only change the state late, after
> everything succeds
> > 2. Rollback state change if anything fails while resuming
> > 3. Patch the specific case of "resume is not supported"
> > 4. Do nothing
> >
> > The proposed fix is #3, since it's the less intrusive (ie. doesn't
> change the core resume logic) patch for a very specific and narrow case.
> IMO the "right" fix is #1, but it seems a more radical change. #2 seems
> tempting in terms of code changes, but I find it less compelling overall
> since rollbacks are generally fragile if the code is not designed with that
> in mind.
> >
> > So, LLDB experts, what do you think? I'd be happy to go with #1 if
> someone with experience in this area can review the change. Anything else
> that I missed?
> >
>
> Seems like #1 is the right choice if you're willing to give it a try.  You
> are right that this code is tricky, but it would be worth the effort to fix
> it right, if for no other reason than to get some other eyes on it.  I'm
> happy to answer any questions as you go along.  This code has ping-ponged
> back and forth between Greg and me, so we should both be able to help you
> out.
>
> Jim
>
> > Thanks!
> >
> >
> >
> >
> >
> >
> > On Sun, Sep 10, 2017 at 1:52 PM, Jim Ingham via lldb-commits <
> lldb-commits at lists.llvm.org> wrote:
> >
> >> On Sep 9, 2017, at 7:38 PM, Zachary Turner <zturner at google.com> wrote:
> >>
> >> Sure, but reading a core file involves handling user input. Obviously
> that has to be sanitized and you can't crash on user input. I don't think
> there's ever been any disagreement about that. On the other hand, if you
> just type an expression in the debugger and expect an answer back, after
> clang says the ast is valid, the whole stack should be asserts all the way
> down, because everything is validated
> >
> > Yes, but even that ends up not matching with the facts on the ground,
> because the debug information for types as used by lldb has to be produced
> from debug info, and the debug info can be malformed and not infrequently
> is.  So at some point clang is going to ask for something that it assumes
> must always be available because if it had read the types in from header
> files it would have asserted at an early stage but when it is fed types
> from debug info incrementally - which is the only efficient way the
> debugger can do it - is going to cause something to look wrong at a point
> where that “shouldn’t be possible.”  clang's rigid structure means it can’t
> cope with this and will crash on something the user didn’t themselves get
> wrong.
> >
> > According to your classification, we should really consider debug
> information “user input” as well.  But that means we’re wiring “user input”
> into a pretty deep place in clang and you’d probably have to weaken your
> strict assert model to handle that. And it makes me wonder if there’s
> anything the debugger consumes that wouldn’t fall into this category?
> >
> > For instance, we also let plugins provide threads and their stacks, and
> we try pretty hard not to crash and burn if these plugins get it a little
> wrong.  So process state really needs to be treated as “user data” as
> well.  And since we use SPI’s for most of this info that we are pretty much
> the only clients of, they can be expected to be flakey as well, again a
> fact that we should not inflict on users of the debugger if we can help it.
> >
> > The answer of trying to do everything dangerous out of process I think
> will make a  slow and fragile architecture.  And while I agree with the
> llvm.org decision not to use C++ exceptions because they make reasoning
> about the program more difficult, replacing that with SIGSEGV as the
> exception mechanism of choice seems really weak.  You are not the first
> person to point out that we could use the crash-protected threads, but I’m
> remain very unenthusiastic about this approach...
> >
> > Moreover, there are a bunch of problematic areas in lldb, but I don’t
> think any of them are problematic for the reasons that you are describing
> here.  There are areas of weak memory management in the data formatters &
> some places where we don’t manage locking right in handling process stops &
> the reconstruction of program state after that, and those cause the vast
> majority of the infrequent but annoying crashes we see.  I don’t myself
> believe that putting in this assert everywhere architecture would pay off
> in reduced crashes to compensate for the brittleness it would introduce.
> >
> > Finally, and most crucially, no failure of any expression evaluation and
> no malformed type is worth taking down a debug session, and it’s our
> responsibility working on lldb to make sure it doesn't.  If the makes our
> job a little harder, so be it.  This was Jason’s point earlier, but it is
> worth reiterating because it makes lldb a very different kind of program
> from all the other programs in the suite collected under the llvm project.
> If the compiler gets into an inconsistent state compiling a source file,
> it’s fine for it to just exit.  It wasn’t going to do any more useful work
> anyway.  That is definitely NOT true of lldb, and makes reasonable the
> notion that general solutions that seem appropriate for the other tools are
> not appropriate for lldb.
> >
> >
> > Jim
> >
> >
> >> On Sat, Sep 9, 2017 at 6:53 PM Jim Ingham <jingham at apple.com> wrote:
> >> I think we are talking at cross purposes.  Seems to me you are saying
> “If we can assert that the answers to questions I ask must always
> copacetic, then we can express that in software, and that will make things
> so much easier".  I’m saying "my experience of the data debuggers have to
> deal with is such that assuming such assertions will lead you to over-react
> to such errors.  Instead you should write your code so that if somebody
> gives you bad data you don’t fall over allowing the people who called you
> to decide how important the error was.”  Every core file that was written
> out by OS X for years had a section that was ill-formed.  Asserting when
> you get an ill-formed object file might seem a good way to ensure that you
> don’t have to make guesses that might lead you astray.  But the people who
> need to load core files on OS X are rightly unmoved by such arguments if
> lldb disappears out from under them when reading in core files.
> >>
> >> Jim
> >>
> >>
> >>
> >>> On Sep 9, 2017, at 1:31 PM, Zachary Turner <zturner at google.com> wrote:
> >>>
> >>>
> >>>
> >>> On Sat, Sep 9, 2017 at 12:04 PM Jim Ingham <jingham at apple.com> wrote:
> >>>
> >>> I disagree here.  If you can reasonably unwind from an error you
> should do so even if you can’t figure out how you could have gotten an
> answer you didn’t expect.  If an API is returning a pointer to something
> you should assume it might return nullptr unless the API explicitly states
> otherwise.
> >>> But that's exactly what an assert is.  It's an explicit statement by
> the API about what should happen.  Which is why by adding them liberally,
> these assumptions can then be propagated all the way up through many layers
> of the code, vastly simplifying the codebase.
> >>>
> >>> if you have
> >>>
> >>> void *foo(int x) {
> >>>   // do some stuff
> >>>
> >>>   assert(x < 0 || foo != nullptr);
> >>> }
> >>>
> >>> Then you're documenting that if x is greater than 0, the caller
> doesn't need to check the return value for nullptr.  Now instead of this:
> >>>
> >>> void *bar(unsigned x) {
> >>>   void *ptr = foo(x);
> >>>   if (!ptr) {
> >>>     // log an error
> >>>     return nullptr;
> >>>   }
> >>>   return ptr;
> >>> }
> >>>
> >>> You just have
> >>>
> >>> void *bar(unsigned x) {
> >>>   void *ptr = foo(x);
> >>>   assert(x);
> >>>   return x;
> >>> }
> >>>
> >>> And now the caller of bar doesn't have to check either.  The code has
> greatly reduced complexity due to the butterfly efflect of propagating
> these assumptions up.
> >>>
> >>> This is a simple example but the point is that building assumptions
> into your API is a good thing, because you can enforce them and it vastly
> simplifies the code.
> >>
> >
> >
> > _______________________________________________
> > lldb-commits mailing list
> > lldb-commits at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-commits
> >
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20170911/b0519242/attachment-0001.html>