<font size=2 face="sans-serif">> You want to figure out which one the

accurate signal and use that. </font>

<br><font size=2 face="sans-serif">> Doesn't matter how you do this,

but this will be up to the </font>

<br><font size=2 face="sans-serif">> ProcessELFCore or ThreadELFCore

classes.</font>

<br>

<br><font size=2 face="sans-serif">I'm going to do a little more research

(books and google) to see if I can get an answer on this one. I'm actually

having trouble finding core files (at least in my own collection) where

threads have different signals in info.si_signo in PRSTATUS. For the ones

I've checked that crashed or received a signal all the threads have the

same value in info.si_signo. Typically just one thread (the thread that

triggered or received the signal) has a SIGINFO note for the thread that

actually received the signal. (My collection of cores is a bit random so

that's not a comprehensive survey by any means.)</font>

<br>

<br><font size=2 face="sans-serif">I'm getting the impression that the

value in PRSTATUS may be for the whole process with any thread that actually

received a signal having a SIGINFO note containing that information but

I'm not totally sure either way yet. I haven't found anything that documents

that behaviour yet. (If anyone knows of a good reference please let me

know!) It would explain why all the threads in a core created by gcore

have a SIGINFO note as each one will be stopped in turn. It would also

mean that for the non-gcore created cores I've got (from crashes and kills)

only one thread would have a non-zero signal which sounds correct. Currently

for those core files running "thread list" shows all threads

as having stopped on the same signal with only one thread in a position

where that signal makes sense. Switching to not use info.si_signo is a

slightly bigger change though!</font>

<br>

<br><font size=2 face="sans-serif">> > - Never allow a threads signal

number to be 0 when it comes form </font>

<br><font size=2 face="sans-serif">> an elf core dump. (This is probably

as much of a band aid as the </font>

<br><font size=2 face="sans-serif">> first solution.)</font>

<br><font size=2 face="sans-serif">> </font>

<br><font size=2 face="sans-serif">> Threads should be able to have

no signal. If you have 10 threads and</font>

<br><font size=2 face="sans-serif">> thread 6 crashes with SIGABRT,

but all other threads were just </font>

<br><font size=2 face="sans-serif">> running, I would expect all threads

except for thread 6 to have 0 </font>

<br><font size=2 face="sans-serif">> signal values, or no stop reason.

If you end up with 10 threads and </font>

<br><font size=2 face="sans-serif">> all have no signal information,

I would say that you can just give </font>

<br><font size=2 face="sans-serif">> the first thread a SIGSTOP to be

safe.</font>

<br>

<br><font size=2 face="sans-serif">I checked this with one of the gcore

files by just setting the first threads signal and leaving the others to

pick up 0 as they used to. That works.</font>

<br>

<br><font size=2 face="sans-serif">Putting in a check that makes sure that

at least one thread that has some kind of signal seems reasonable. I'll

add that as a fallback sanity check.</font>

<br>

<br><font size=2 face="sans-serif">> The suggested can be done in a

cleaner way: Have ProcessELFCore and </font>

<br><font size=2 face="sans-serif">> ProcessMachCore override "Error

Process::WillResume()" just return an error:</font>

<br><font size=2 face="sans-serif">> </font>

<br><font size=2 face="sans-serif">> Error ProcessELFCore::WillResume()

</font>

<br><font size=2 face="sans-serif">> {</font>

<br><font size=2 face="sans-serif">>     return Error("can't

resume a process in a core file");</font>

<br><font size=2 face="sans-serif">> }</font>

<br>

<br><font size=2 face="sans-serif">I think that's called too late. It's

not called until the decision has been made to resume the process. Also

the base implementation already returns an error and I don't think either

ProcessElfCore or ProcessMachCore override it.</font>

<br>

<br><font size=2 face="sans-serif">> So I think the correct fix is all

three of the above.</font>

<br>

<br><font size=2 face="sans-serif">I think it's close and discussing the

problem is actually helping a lot, thanks for the help. I'll grab the bug

and put up a patch - hopefully tomorrow.</font>

<br>

<br><font size=2 face="sans-serif">Thanks,</font>

<br>

<br><font size=2 face="sans-serif">Howard Hellyer </font>

<br><font size=2 face="sans-serif">IBM Runtime Technologies, IBM Systems

        </font>

<br><font size=1 face="Arial"><br>

</font>

<br>

<br><tt><font size=2>Greg Clayton <gclayton@apple.com> wrote on 11/11/2016

18:07:03:<br>

<br>

> From: Greg Clayton <gclayton@apple.com></font></tt>

<br><tt><font size=2>> To: Howard Hellyer/UK/IBM@IBMGB</font></tt>

<br><tt><font size=2>> Cc: Jim Ingham <jingham@apple.com>, lldb-dev@lists.llvm.org</font></tt>

<br><tt><font size=2>> Date: 11/11/2016 18:07</font></tt>

<br><tt><font size=2>> Subject: Re: [lldb-dev] LLDB hang loading Linux

core files from live<br>

> processes (Bug 26322)</font></tt>

<br><tt><font size=2>> <br>

> I think both are valid fixes. Threads in core files can have a non-<br>

> zero signal. See comments below.<br>

> <br>

> > On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb-<br>

> dev@lists.llvm.org> wrote:<br>

> > <br>

> > Hi Jim <br>

> > <br>

> > I was afraid someone would say that but I've done some digging

and<br>

> found a difference in the core files I get generated by gcore to <br>

> those generated by a crash or abort. <br>

> > <br>

> > Most of the core files have one SIGINFO structure in the core,

I <br>

> think it belongs to the preceding thread (the one that caught the

signal). <br>

> > In the core files generated by gcore all of the threads have

a <br>

> SIGINFO structure following their PRSTATUS structure. In the non-<br>

> gcore files the value of info.si_signo in the PRSTATUS structure is

<br>

> a signal number. In the gcore file this is actually 0 but the <br>

> SIGINFO structure following PRSTATUS has an si_signo value of 19.

<br>

> > <br>

> > Looking at it with eu-readelf shows: <br>

> > <br>

> >   CORE              

  336  PRSTATUS <br>

> >     info.si_signo: 0, info.si_code: 0, info.si_errno:

0, cursig: 0 <br>

> >     sigpend: <> <br>

> >     sighold: <> <br>

> > ... lots of registsers... <br>

> >   CORE              

  128  SIGINFO <br>

> >     si_signo: 19, si_errno: 0, si_code: 0 <br>

> >     sender PID: 0, sender UID: 0 <br>

> > <br>

> > I think gcore is being clever. It's including the "real"

signal <br>

> number the running thread had received at the time the core was <br>

> taken (info.si_signo is 0) but also the signal it had used to <br>

> interrupt the thread and gather it's state. The value in PRSTATUS

<br>

> info.si_signo is the signal number that ends up in m_signo in <br>

> ThreadElfCore and ultimately is looked for in the set of signals <br>

> lldb should stop on in UnixSignals::GetShouldStop. 0 is not found

in<br>

> that set since there isn't a signal 0. I think gcore is doing all

<br>

> this so that it preserves the real signal state the process had <br>

> before gcore attached to it, I guess in case you are trying to debug<br>

> something to do with signals and need to see that state. (That's a

<br>

> bit of a guess mind you.) <br>

> > <br>

> > I can think of three solutions: <br>

> > <br>

> > - Read the signal information from the SIGINFO block for a thread

<br>

> if it's present. Core files generated by abort or a crash only seem

<br>

> to have a SIGINFO for one thread which looks like it's the one that

<br>

> received/trigger the signal in the first place. This means adding

a <br>

> something to parse that block out of the elf core as well as <br>

> PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO

<br>

> always seems to come after PRSTATUS and probably has to as PRSTATUS

<br>

> contains the pid and identifies that there is a new thread in the

<br>

> core so if SIGINFO is found that signal number will just replace thefirst

one.<br>

> <br>

> You want to figure out which one the accurate signal and use that.

<br>

> Doesn't matter how you do this, but this will be up to the <br>

> ProcessELFCore or ThreadELFCore classes.<br>

> > <br>

> > - Never allow a threads signal number to be 0 when it comes form

<br>

> an elf core dump. (This is probably as much of a band aid as the <br>

> first solution.)<br>

> <br>

> Threads should be able to have no signal. If you have 10 threads and<br>

> thread 6 crashes with SIGABRT, but all other threads were just <br>

> running, I would expect all threads except for thread 6 to have 0

<br>

> signal values, or no stop reason. If you end up with 10 threads and

<br>

> all have no signal information, I would say that you can just give

<br>

> the first thread a SIGSTOP to be safe.<br>

> <br>

> > <br>

> > - Stick with the first solution of saying that we can never resume<br>

> a core file. The only thing in this solutions favour is that it <br>

> means the "real" thread state that gcore tried to preserve

is known <br>

> to lldb. Once the core file is loaded typing continue does result

in<br>

> an error message telling you that you can't resume from a core file.

<br>

> <br>

> The suggested can be done in a cleaner way: Have ProcessELFCore and

<br>

> ProcessMachCore override "Error Process::WillResume()" just

return an error:<br>

> <br>

> Error ProcessELFCore::WillResume() <br>

> {<br>

>     return Error("can't resume a process in a core

file");<br>

> }<br>

> <br>

> So I think the correct fix is all three of the above.<br>

> <br>

> Greg<br>

> <br>

> > <br>

> > I'll have a go at prototyping the solution to read the SIGINFO

<br>

> structure but I'd appreciate any thoughts on which is the "correct"

fix. <br>

> > <br>

> > Thanks, <br>

> > <br>

> > <br>

> > Howard Hellyer <br>

> > IBM Runtime Technologies, IBM Systems        

</font></tt><font size=2 face="sans-serif"><br>

Unless stated otherwise above:<br>

IBM United Kingdom Limited - Registered in England and Wales with number

741598. <br>

Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6

3AU<br>

</font>