[lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322)

Mon Nov 14 06:46:24 PST 2016

> You want to figure out which one the accurate signal and use that. 
> Doesn't matter how you do this, but this will be up to the 
> ProcessELFCore or ThreadELFCore classes.

I'm going to do a little more research (books and google) to see if I can 
get an answer on this one. I'm actually having trouble finding core files 
(at least in my own collection) where threads have different signals in 
info.si_signo in PRSTATUS. For the ones I've checked that crashed or 
received a signal all the threads have the same value in info.si_signo. 
Typically just one thread (the thread that triggered or received the 
signal) has a SIGINFO note for the thread that actually received the 
signal. (My collection of cores is a bit random so that's not a 
comprehensive survey by any means.)

I'm getting the impression that the value in PRSTATUS may be for the whole 
process with any thread that actually received a signal having a SIGINFO 
note containing that information but I'm not totally sure either way yet. 
I haven't found anything that documents that behaviour yet. (If anyone 
knows of a good reference please let me know!) It would explain why all 
the threads in a core created by gcore have a SIGINFO note as each one 
will be stopped in turn. It would also mean that for the non-gcore created 
cores I've got (from crashes and kills) only one thread would have a 
non-zero signal which sounds correct. Currently for those core files 
running "thread list" shows all threads as having stopped on the same 
signal with only one thread in a position where that signal makes sense. 
Switching to not use info.si_signo is a slightly bigger change though!

> > - Never allow a threads signal number to be 0 when it comes form 
> an elf core dump. (This is probably as much of a band aid as the 
> first solution.)
> 
> Threads should be able to have no signal. If you have 10 threads and
> thread 6 crashes with SIGABRT, but all other threads were just 
> running, I would expect all threads except for thread 6 to have 0 
> signal values, or no stop reason. If you end up with 10 threads and 
> all have no signal information, I would say that you can just give 
> the first thread a SIGSTOP to be safe.

I checked this with one of the gcore files by just setting the first 
threads signal and leaving the others to pick up 0 as they used to. That 
works.

Putting in a check that makes sure that at least one thread that has some 
kind of signal seems reasonable. I'll add that as a fallback sanity check.

> The suggested can be done in a cleaner way: Have ProcessELFCore and 
> ProcessMachCore override "Error Process::WillResume()" just return an 
error:
> 
> Error ProcessELFCore::WillResume() 
> {
>     return Error("can't resume a process in a core file");
> }

I think that's called too late. It's not called until the decision has 
been made to resume the process. Also the base implementation already 
returns an error and I don't think either ProcessElfCore or 
ProcessMachCore override it.

> So I think the correct fix is all three of the above.

I think it's close and discussing the problem is actually helping a lot, 
thanks for the help. I'll grab the bug and put up a patch - hopefully 
tomorrow.

Thanks,

Howard Hellyer 
IBM Runtime Technologies, IBM Systems 

Greg Clayton <gclayton at apple.com> wrote on 11/11/2016 18:07:03:

> From: Greg Clayton <gclayton at apple.com>
> To: Howard Hellyer/UK/IBM at IBMGB
> Cc: Jim Ingham <jingham at apple.com>, lldb-dev at lists.llvm.org
> Date: 11/11/2016 18:07
> Subject: Re: [lldb-dev] LLDB hang loading Linux core files from live
> processes (Bug 26322)
> 
> I think both are valid fixes. Threads in core files can have a non-
> zero signal. See comments below.
> 
> > On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb-
> dev at lists.llvm.org> wrote:
> > 
> > Hi Jim 
> > 
> > I was afraid someone would say that but I've done some digging and
> found a difference in the core files I get generated by gcore to 
> those generated by a crash or abort. 
> > 
> > Most of the core files have one SIGINFO structure in the core, I 
> think it belongs to the preceding thread (the one that caught the 
signal). 
> > In the core files generated by gcore all of the threads have a 
> SIGINFO structure following their PRSTATUS structure. In the non-
> gcore files the value of info.si_signo in the PRSTATUS structure is 
> a signal number. In the gcore file this is actually 0 but the 
> SIGINFO structure following PRSTATUS has an si_signo value of 19. 
> > 
> > Looking at it with eu-readelf shows: 
> > 
> >   CORE                 336  PRSTATUS 
> >     info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0 
> >     sigpend: <> 
> >     sighold: <> 
> > ... lots of registsers... 
> >   CORE                 128  SIGINFO 
> >     si_signo: 19, si_errno: 0, si_code: 0 
> >     sender PID: 0, sender UID: 0 
> > 
> > I think gcore is being clever. It's including the "real" signal 
> number the running thread had received at the time the core was 
> taken (info.si_signo is 0) but also the signal it had used to 
> interrupt the thread and gather it's state. The value in PRSTATUS 
> info.si_signo is the signal number that ends up in m_signo in 
> ThreadElfCore and ultimately is looked for in the set of signals 
> lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in
> that set since there isn't a signal 0. I think gcore is doing all 
> this so that it preserves the real signal state the process had 
> before gcore attached to it, I guess in case you are trying to debug
> something to do with signals and need to see that state. (That's a 
> bit of a guess mind you.) 
> > 
> > I can think of three solutions: 
> > 
> > - Read the signal information from the SIGINFO block for a thread 
> if it's present. Core files generated by abort or a crash only seem 
> to have a SIGINFO for one thread which looks like it's the one that 
> received/trigger the signal in the first place. This means adding a 
> something to parse that block out of the elf core as well as 
> PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO 
> always seems to come after PRSTATUS and probably has to as PRSTATUS 
> contains the pid and identifies that there is a new thread in the 
> core so if SIGINFO is found that signal number will just replace 
thefirst one.
> 
> You want to figure out which one the accurate signal and use that. 
> Doesn't matter how you do this, but this will be up to the 
> ProcessELFCore or ThreadELFCore classes.
> > 
> > - Never allow a threads signal number to be 0 when it comes form 
> an elf core dump. (This is probably as much of a band aid as the 
> first solution.)
> 
> Threads should be able to have no signal. If you have 10 threads and
> thread 6 crashes with SIGABRT, but all other threads were just 
> running, I would expect all threads except for thread 6 to have 0 
> signal values, or no stop reason. If you end up with 10 threads and 
> all have no signal information, I would say that you can just give 
> the first thread a SIGSTOP to be safe.
> 
> > 
> > - Stick with the first solution of saying that we can never resume
> a core file. The only thing in this solutions favour is that it 
> means the "real" thread state that gcore tried to preserve is known 
> to lldb. Once the core file is loaded typing continue does result in
> an error message telling you that you can't resume from a core file. 
> 
> The suggested can be done in a cleaner way: Have ProcessELFCore and 
> ProcessMachCore override "Error Process::WillResume()" just return an 
error:
> 
> Error ProcessELFCore::WillResume() 
> {
>     return Error("can't resume a process in a core file");
> }
> 
> So I think the correct fix is all three of the above.
> 
> Greg
> 
> > 
> > I'll have a go at prototyping the solution to read the SIGINFO 
> structure but I'd appreciate any thoughts on which is the "correct" fix. 

> > 
> > Thanks, 
> > 
> > 
> > Howard Hellyer 
> > IBM Runtime Technologies, IBM Systems         
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20161114/7bfa98a6/attachment.html>