[lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322)

Fri Nov 11 10:07:03 PST 2016

I think both are valid fixes. Threads in core files can have a non-zero signal. See comments below.

> On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb-dev at lists.llvm.org> wrote:
> 
> Hi Jim 
> 
> I was afraid someone would say that but I've done some digging and found a difference in the core files I get generated by gcore to those generated by a crash or abort. 
> 
> Most of the core files have one SIGINFO structure in the core, I think it belongs to the preceding thread (the one that caught the signal). 
> In the core files generated by gcore all of the threads have a SIGINFO structure following their PRSTATUS structure. In the non-gcore files the value of info.si_signo in the PRSTATUS structure is a signal number. In the gcore file this is actually 0 but the SIGINFO structure following PRSTATUS has an si_signo value of 19. 
> 
> Looking at it with eu-readelf shows: 
> 
>   CORE                 336  PRSTATUS 
>     info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0 
>     sigpend: <> 
>     sighold: <> 
> ... lots of registsers... 
>   CORE                 128  SIGINFO 
>     si_signo: 19, si_errno: 0, si_code: 0 
>     sender PID: 0, sender UID: 0 
> 
> I think gcore is being clever. It's including the "real" signal number the running thread had received at the time the core was taken (info.si_signo is 0) but also the signal it had used to interrupt the thread and gather it's state. The value in PRSTATUS info.si_signo is the signal number that ends up in m_signo in ThreadElfCore and ultimately is looked for in the set of signals lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in that set since there isn't a signal 0. I think gcore is doing all this so that it preserves the real signal state the process had before gcore attached to it, I guess in case you are trying to debug something to do with signals and need to see that state. (That's a bit of a guess mind you.) 
> 
> I can think of three solutions: 
> 
> - Read the signal information from the SIGINFO block for a thread if it's present. Core files generated by abort or a crash only seem to have a SIGINFO for one thread which looks like it's the one that received/trigger the signal in the first place. This means adding a something to parse that block out of the elf core as well as PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO  always seems to come after PRSTATUS and probably has to as PRSTATUS contains the pid and identifies that there is a new thread in the core so if SIGINFO is found that signal number will just replace the first one.

You want to figure out which one the accurate signal and use that. Doesn't matter how you do this, but this will be up to the ProcessELFCore or ThreadELFCore classes.
> 
> - Never allow a threads signal number to be 0 when it comes form an elf core dump. (This is probably as much of a band aid as the first solution.)

Threads should be able to have no signal. If you have 10 threads and thread 6 crashes with SIGABRT, but all other threads were just running, I would expect all threads except for thread 6 to have 0 signal values, or no stop reason. If you end up with 10 threads and all have no signal information, I would say that you can just give the first thread a SIGSTOP to be safe.

> 
> - Stick with the first solution of saying that we can never resume a core file. The only thing in this solutions favour is that it means the "real" thread state that gcore tried to preserve is known to lldb. Once the core file is loaded typing continue does result in an error message telling you that you can't resume from a core file. 

The suggested can be done in a cleaner way: Have ProcessELFCore and ProcessMachCore override "Error Process::WillResume()" just return an error:

Error ProcessELFCore::WillResume() 
{
    return Error("can't resume a process in a core file");
}

So I think the correct fix is all three of the above.

Greg

> 
> I'll have a go at prototyping the solution to read the SIGINFO structure but I'd appreciate any thoughts on which is the "correct" fix. 
> 
> Thanks, 
> 
> 
> Howard Hellyer 
> IBM Runtime Technologies, IBM Systems         
> 
> 
> 
> 
> 
> From:        Jim Ingham <jingham at apple.com> 
> To:        Howard Hellyer/UK/IBM at IBMGB 
> Cc:        lldb-dev at lists.llvm.org 
> Date:        10/11/2016 18:48 
> Subject:        Re: [lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322) 
> Sent by:        jingham at apple.com 
> 
> 
> 
> I think that approach is kind of a bandaid.  
> 
> Core files can't resume, so it would be better to figure out why telling a core file which can't resume to resume caused us to go into a tail spin.  That should just fall out of WillResume returning false or some other better general signal.  Special-casing core files seems a bit of a hack.
> 
> That being said, if nobody has time to make a better solution, a bandaid is better than bleeding...
> 
> Jim
> 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
> _______________________________________________
> lldb-dev mailing list
> lldb-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev