[lldb-dev] LLDB hang loading Linux core files from live processes (Bug 26322)
Howard Hellyer via lldb-dev
lldb-dev at lists.llvm.org
Mon Nov 14 06:46:24 PST 2016
> You want to figure out which one the accurate signal and use that.
> Doesn't matter how you do this, but this will be up to the
> ProcessELFCore or ThreadELFCore classes.
I'm going to do a little more research (books and google) to see if I can
get an answer on this one. I'm actually having trouble finding core files
(at least in my own collection) where threads have different signals in
info.si_signo in PRSTATUS. For the ones I've checked that crashed or
received a signal all the threads have the same value in info.si_signo.
Typically just one thread (the thread that triggered or received the
signal) has a SIGINFO note for the thread that actually received the
signal. (My collection of cores is a bit random so that's not a
comprehensive survey by any means.)
I'm getting the impression that the value in PRSTATUS may be for the whole
process with any thread that actually received a signal having a SIGINFO
note containing that information but I'm not totally sure either way yet.
I haven't found anything that documents that behaviour yet. (If anyone
knows of a good reference please let me know!) It would explain why all
the threads in a core created by gcore have a SIGINFO note as each one
will be stopped in turn. It would also mean that for the non-gcore created
cores I've got (from crashes and kills) only one thread would have a
non-zero signal which sounds correct. Currently for those core files
running "thread list" shows all threads as having stopped on the same
signal with only one thread in a position where that signal makes sense.
Switching to not use info.si_signo is a slightly bigger change though!
> > - Never allow a threads signal number to be 0 when it comes form
> an elf core dump. (This is probably as much of a band aid as the
> first solution.)
>
> Threads should be able to have no signal. If you have 10 threads and
> thread 6 crashes with SIGABRT, but all other threads were just
> running, I would expect all threads except for thread 6 to have 0
> signal values, or no stop reason. If you end up with 10 threads and
> all have no signal information, I would say that you can just give
> the first thread a SIGSTOP to be safe.
I checked this with one of the gcore files by just setting the first
threads signal and leaving the others to pick up 0 as they used to. That
works.
Putting in a check that makes sure that at least one thread that has some
kind of signal seems reasonable. I'll add that as a fallback sanity check.
> The suggested can be done in a cleaner way: Have ProcessELFCore and
> ProcessMachCore override "Error Process::WillResume()" just return an
error:
>
> Error ProcessELFCore::WillResume()
> {
> return Error("can't resume a process in a core file");
> }
I think that's called too late. It's not called until the decision has
been made to resume the process. Also the base implementation already
returns an error and I don't think either ProcessElfCore or
ProcessMachCore override it.
> So I think the correct fix is all three of the above.
I think it's close and discussing the problem is actually helping a lot,
thanks for the help. I'll grab the bug and put up a patch - hopefully
tomorrow.
Thanks,
Howard Hellyer
IBM Runtime Technologies, IBM Systems
Greg Clayton <gclayton at apple.com> wrote on 11/11/2016 18:07:03:
> From: Greg Clayton <gclayton at apple.com>
> To: Howard Hellyer/UK/IBM at IBMGB
> Cc: Jim Ingham <jingham at apple.com>, lldb-dev at lists.llvm.org
> Date: 11/11/2016 18:07
> Subject: Re: [lldb-dev] LLDB hang loading Linux core files from live
> processes (Bug 26322)
>
> I think both are valid fixes. Threads in core files can have a non-
> zero signal. See comments below.
>
> > On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb-
> dev at lists.llvm.org> wrote:
> >
> > Hi Jim
> >
> > I was afraid someone would say that but I've done some digging and
> found a difference in the core files I get generated by gcore to
> those generated by a crash or abort.
> >
> > Most of the core files have one SIGINFO structure in the core, I
> think it belongs to the preceding thread (the one that caught the
signal).
> > In the core files generated by gcore all of the threads have a
> SIGINFO structure following their PRSTATUS structure. In the non-
> gcore files the value of info.si_signo in the PRSTATUS structure is
> a signal number. In the gcore file this is actually 0 but the
> SIGINFO structure following PRSTATUS has an si_signo value of 19.
> >
> > Looking at it with eu-readelf shows:
> >
> > CORE 336 PRSTATUS
> > info.si_signo: 0, info.si_code: 0, info.si_errno: 0, cursig: 0
> > sigpend: <>
> > sighold: <>
> > ... lots of registsers...
> > CORE 128 SIGINFO
> > si_signo: 19, si_errno: 0, si_code: 0
> > sender PID: 0, sender UID: 0
> >
> > I think gcore is being clever. It's including the "real" signal
> number the running thread had received at the time the core was
> taken (info.si_signo is 0) but also the signal it had used to
> interrupt the thread and gather it's state. The value in PRSTATUS
> info.si_signo is the signal number that ends up in m_signo in
> ThreadElfCore and ultimately is looked for in the set of signals
> lldb should stop on in UnixSignals::GetShouldStop. 0 is not found in
> that set since there isn't a signal 0. I think gcore is doing all
> this so that it preserves the real signal state the process had
> before gcore attached to it, I guess in case you are trying to debug
> something to do with signals and need to see that state. (That's a
> bit of a guess mind you.)
> >
> > I can think of three solutions:
> >
> > - Read the signal information from the SIGINFO block for a thread
> if it's present. Core files generated by abort or a crash only seem
> to have a SIGINFO for one thread which looks like it's the one that
> received/trigger the signal in the first place. This means adding a
> something to parse that block out of the elf core as well as
> PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO
> always seems to come after PRSTATUS and probably has to as PRSTATUS
> contains the pid and identifies that there is a new thread in the
> core so if SIGINFO is found that signal number will just replace
thefirst one.
>
> You want to figure out which one the accurate signal and use that.
> Doesn't matter how you do this, but this will be up to the
> ProcessELFCore or ThreadELFCore classes.
> >
> > - Never allow a threads signal number to be 0 when it comes form
> an elf core dump. (This is probably as much of a band aid as the
> first solution.)
>
> Threads should be able to have no signal. If you have 10 threads and
> thread 6 crashes with SIGABRT, but all other threads were just
> running, I would expect all threads except for thread 6 to have 0
> signal values, or no stop reason. If you end up with 10 threads and
> all have no signal information, I would say that you can just give
> the first thread a SIGSTOP to be safe.
>
> >
> > - Stick with the first solution of saying that we can never resume
> a core file. The only thing in this solutions favour is that it
> means the "real" thread state that gcore tried to preserve is known
> to lldb. Once the core file is loaded typing continue does result in
> an error message telling you that you can't resume from a core file.
>
> The suggested can be done in a cleaner way: Have ProcessELFCore and
> ProcessMachCore override "Error Process::WillResume()" just return an
error:
>
> Error ProcessELFCore::WillResume()
> {
> return Error("can't resume a process in a core file");
> }
>
> So I think the correct fix is all three of the above.
>
> Greg
>
> >
> > I'll have a go at prototyping the solution to read the SIGINFO
> structure but I'd appreciate any thoughts on which is the "correct" fix.
> >
> > Thanks,
> >
> >
> > Howard Hellyer
> > IBM Runtime Technologies, IBM Systems
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20161114/7bfa98a6/attachment.html>
More information about the lldb-dev
mailing list