[Lldb-commits] [PATCH] D37651: Fix for bug 34532 - A few rough corners related to post-mortem debugging (core/minidump)

Sat Sep 9 12:04:02 PDT 2017

> On Sep 9, 2017, at 11:36 AM, Zachary Turner <zturner at google.com> wrote:
> 
> 
> 
> On Sat, Sep 9, 2017 at 11:18 AM Jim Ingham <jingham at apple.com <mailto:jingham at apple.com>> wrote:
>> On Sep 8, 2017, at 11:45 PM, Zachary Turner <zturner at google.com <mailto:zturner at google.com>> wrote:
>> 
> 
>> On Fri, Sep 8, 2017 at 8:19 PM Jason Molenda <jmolenda at apple.com <mailto:jmolenda at apple.com>> wrote:
>> Also keep in mind that debug sessions have a tendency to be long lived.  I may be working through a problem for half an hour -- or this may be the one rare instance where a bug reproduces -- and crashing because some bogus piece of debug info that I don't care about is invalid is not acceptable.  I've gotten those bug reports, where someone has spent all day getting to just the point where they have the problem in front of them, only to have the debugger crash, and they are rightfully enraged at the debugger for this.
>> 
>> An interactive tool like the debugger cannot use a model where asserting is acceptable.  Even if lldb is out-of-process, you're losing all of the work you'd done until then, and complex bugs can involve extraordinarily long running debug sessions.  The fact that llvm/clang can assert out from under us is to the detriment of lldb's quality.
>> 
>> J
>> Fwiw i think we all agree that it shouldn't crash.  What we disagree on is the best way of making it crash less.  It's easy to see why someone would argue that adding an assert increases the incidence of crashes.  I mean on the surface it seems ridiculous to argue otherwise.  
>> 
>> But otherwise is precisely what we all continue to argue.
>> 
>> Your position is that by going out of your way to avoid crashes, it will crash less and be more stable.  Pretty intuitive position if I'm being honest.  But I don't agree.  
>> 
>> I think you are trading short term metrics for long term technical debt
>> You fix one crash by adding a null check, without figuring out why null ever got in there in the first place.  
>> 
>> You get fewer bug reports because instead of crashing, which would hopefully trigger some automatic crash uploader in Xcode that automatically files a rdar, you just never find out about the bug in the first place.
>> 
>> You introduce even more bugs, because when you go to edit a function, its complexity and branching is through the roof, and you overlook something some corner case.
>> 
>> You have less test coverage because you introduce another untested branch in the code, reducing the already abysmal code coverage even further.
>> 
>> The way to reduce crashing is to *increase* the code coverage of your test suite.  That is the solution.  And you don't do that by adding null checks everywhere, you do it by removing them and asserting.
>> 
>> Yes, it means Xcode might crash.  But you know the bug exists!  How many bugs are out there right now that nobody even knows about because they are hidden by a null pointer check and instead of the user trying to figure out what's going and filing a bug report, they just give up and use gdb instead.  That's one more person you're never going to get back.
>> 
>> Crashing Xcode is annoying, but it's fixable.  But when your technical debt is going up and to the right, and your test coverage is going down and to the right, that's an existential threat to the project .
>> 
>> Btw, i still don't understand why asserts cause anything to crash, given that they're supposed to be turned off in a release build
> 
> 
>  I was objecting to the use of llvm_report_fatal_error.  That crashes no matter what.  And I was suggesting replacing that with an lldb_assert.  So I’m not sure we are so much in disagreement about that.
> You're right that crashes no matter what, but when an lldb_assert fires, you only find out about it if someone is kind enough to submit a bug report.  I'm assuming that if it were to crash you would find out about it with an automatically filed rdar with an attached core file.  (I might be wrong about this)

Well, not a core file since those are huge on OS X and in my experience seldom add enough value to be worth the extra cost.  But you would also have crashed.

>  
> 
> I was also objecting to the approach where instead of figuring out why the errors from one call - DoResmume - were not getting handled properly - we add ANOTHER earlier check - CanResume - and then we have to deal with what happens when there’s a code path that doesn’t get caught by the added check.  That is just adding weakness, and a chance that some obscure programmer error gets us confused.
> I agree with you here.
>  
> 
> I also am skeptical about the "just assert if things aren’t quite right" approach, because it removes the responsibility for thinking about how you would get to that state and how you would get out of it safely.  But more crucially, it exaggerates the importance of subsections of lldb’s functionality.  For instance, to the JIT, being asked to handle a relocation that it doesn’t support is a fatal error, so it seems okay by the fail soon lights to just fatal error.  But to lldb no expression is important enough to quit if it doesn’t work, so we would have really very much preferred if the JIT had been forced to back out from the error.  Ditto with llvm & clang and handling types.  Even if the types from one CU are horked up beyond belief, there’s lots of useful debugging you can still do.
> One way to handle this would be to have just the JIT run out of process.  Or to have it run in a separate thread with LLVM's crash recovery handling where a single thread can terminate without taking down the rest of the process.  See llvm::CrashRecoveryContext for example.  If the JIT thinks it's in a state where it can't continue, no amount of backing out is going to help.  Sure, there are often ways to redesign things so that it can fail better, I'm not denying that.  I'm just saying that when you get a bug that says "lldb crashed because x was null", the right fix is probably not a one line change that says "add a null check for x".  Maybe you can return a reference and propagate the requirements up through the API via the type system.  Maybe after some investigation it turns out it was null because you violated one of the API's constraints (which could happen for various reasons, including the API not properly documenting its assumptions, which definitely happens).
> 
>  

Sometimes it is violating some API constraint but more often you are dealing with the fact that you SHOULD get something back from some subsystem but e.g. even though a class with a base class should be able to produce its base class type but the debug info was written in such a way that in this case nobody knew the base class so the type is no good.  The debugger’s world is full of bad data and people lying to it just by the nature of the systems it has to engage with.  So you just have to deal with bad data and being lied to while limiting the damage as narrowly as possible.

> 
> 
> It also seems like you are suggesting some conflict between adding more testing and trying to handle errors rather than exiting when they arise.  There’s really no linkage between the two so far as I can see.
> 
> In the most trivial case, suppose you've got this program:
> 
> int main(int argc, char **argv) {
>   return 0;
> } 
> 
> And you write a test that foo.exe returns 0.
> 
> You've got 100% code coverage.  Now suppose you add this:
> 
> int main(int argc, char **argv) {
>   if (argc > 2)
>     return 1;
>   return 0;
> }
> 
> Your same test now has 50% code coverage.  By adding a branch with no test, you have reduced the code coverage of the system.  Now, you can obviously add a test for this branch, but when the situation is "someone reported in the field that x was null, so we're adding a null pointer check", then you've just made the problem worse.  You can't reproduce it so there's no test, and now maybe they're going to encounter another bug down the road where y is null and caushes a crash, so we have to add a null pointer check for y.  It's a domino effect.  Eventually you get to the point where the code is unintelligible, because nobody understands when (or even if) any of these conditions can happen.
> 
> But it gets worse.  If the problem is "foo() returned null for this guy on line 42 of this file, so let's add a null check", now you have to go look at all the other places foo() is called, of which there could be hundreds.  
> 
> A good rule of thumb is that if the code owner of a particular piece of code can look at a function and not explain in detail the circumstances under which a branch can be entered, then you can delete the branch.

I disagree here.  If you can reasonably unwind from an error you should do so even if you can’t figure out how you could have gotten an answer you didn’t expect.  If an API is returning a pointer to something you should assume it might return nullptr unless the API explicitly states otherwise.  So you should think about what you would do if that’s true. In places where we know we can always produce a result, we use references instead to signal that fact, and if we’re using pointers where we know we will always be able to get an answer, we definitely should fix those API’s.  But you better be really sure you will always get an answer.  Claiming by passing a reference that an operation can’t fail, finding it can fail and adding an assert is not better than admitting that the operation might fail and forcing callers to handle that.

Jim

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-commits/attachments/20170909/ac6710fe/attachment-0001.html>