<font size=2 face="sans-serif">> You want to figure out which one the
accurate signal and use that. </font>
<br><font size=2 face="sans-serif">> Doesn't matter how you do this,
but this will be up to the </font>
<br><font size=2 face="sans-serif">> ProcessELFCore or ThreadELFCore
classes.</font>
<br>
<br><font size=2 face="sans-serif">I'm going to do a little more research
(books and google) to see if I can get an answer on this one. I'm actually
having trouble finding core files (at least in my own collection) where
threads have different signals in info.si_signo in PRSTATUS. For the ones
I've checked that crashed or received a signal all the threads have the
same value in info.si_signo. Typically just one thread (the thread that
triggered or received the signal) has a SIGINFO note for the thread that
actually received the signal. (My collection of cores is a bit random so
that's not a comprehensive survey by any means.)</font>
<br>
<br><font size=2 face="sans-serif">I'm getting the impression that the
value in PRSTATUS may be for the whole process with any thread that actually
received a signal having a SIGINFO note containing that information but
I'm not totally sure either way yet. I haven't found anything that documents
that behaviour yet. (If anyone knows of a good reference please let me
know!) It would explain why all the threads in a core created by gcore
have a SIGINFO note as each one will be stopped in turn. It would also
mean that for the non-gcore created cores I've got (from crashes and kills)
only one thread would have a non-zero signal which sounds correct. Currently
for those core files running "thread list" shows all threads
as having stopped on the same signal with only one thread in a position
where that signal makes sense. Switching to not use info.si_signo is a
slightly bigger change though!</font>
<br>
<br><font size=2 face="sans-serif">> > - Never allow a threads signal
number to be 0 when it comes form </font>
<br><font size=2 face="sans-serif">> an elf core dump. (This is probably
as much of a band aid as the </font>
<br><font size=2 face="sans-serif">> first solution.)</font>
<br><font size=2 face="sans-serif">> </font>
<br><font size=2 face="sans-serif">> Threads should be able to have
no signal. If you have 10 threads and</font>
<br><font size=2 face="sans-serif">> thread 6 crashes with SIGABRT,
but all other threads were just </font>
<br><font size=2 face="sans-serif">> running, I would expect all threads
except for thread 6 to have 0 </font>
<br><font size=2 face="sans-serif">> signal values, or no stop reason.
If you end up with 10 threads and </font>
<br><font size=2 face="sans-serif">> all have no signal information,
I would say that you can just give </font>
<br><font size=2 face="sans-serif">> the first thread a SIGSTOP to be
safe.</font>
<br>
<br><font size=2 face="sans-serif">I checked this with one of the gcore
files by just setting the first threads signal and leaving the others to
pick up 0 as they used to. That works.</font>
<br>
<br><font size=2 face="sans-serif">Putting in a check that makes sure that
at least one thread that has some kind of signal seems reasonable. I'll
add that as a fallback sanity check.</font>
<br>
<br><font size=2 face="sans-serif">> The suggested can be done in a
cleaner way: Have ProcessELFCore and </font>
<br><font size=2 face="sans-serif">> ProcessMachCore override "Error
Process::WillResume()" just return an error:</font>
<br><font size=2 face="sans-serif">> </font>
<br><font size=2 face="sans-serif">> Error ProcessELFCore::WillResume()
</font>
<br><font size=2 face="sans-serif">> {</font>
<br><font size=2 face="sans-serif">> return Error("can't
resume a process in a core file");</font>
<br><font size=2 face="sans-serif">> }</font>
<br>
<br><font size=2 face="sans-serif">I think that's called too late. It's
not called until the decision has been made to resume the process. Also
the base implementation already returns an error and I don't think either
ProcessElfCore or ProcessMachCore override it.</font>
<br>
<br><font size=2 face="sans-serif">> So I think the correct fix is all
three of the above.</font>
<br>
<br><font size=2 face="sans-serif">I think it's close and discussing the
problem is actually helping a lot, thanks for the help. I'll grab the bug
and put up a patch - hopefully tomorrow.</font>
<br>
<br><font size=2 face="sans-serif">Thanks,</font>
<br>
<br><font size=2 face="sans-serif">Howard Hellyer </font>
<br><font size=2 face="sans-serif">IBM Runtime Technologies, IBM Systems
</font>
<br><font size=1 face="Arial"><br>
</font>
<br>
<br><tt><font size=2>Greg Clayton <gclayton@apple.com> wrote on 11/11/2016
18:07:03:<br>
<br>
> From: Greg Clayton <gclayton@apple.com></font></tt>
<br><tt><font size=2>> To: Howard Hellyer/UK/IBM@IBMGB</font></tt>
<br><tt><font size=2>> Cc: Jim Ingham <jingham@apple.com>, lldb-dev@lists.llvm.org</font></tt>
<br><tt><font size=2>> Date: 11/11/2016 18:07</font></tt>
<br><tt><font size=2>> Subject: Re: [lldb-dev] LLDB hang loading Linux
core files from live<br>
> processes (Bug 26322)</font></tt>
<br><tt><font size=2>> <br>
> I think both are valid fixes. Threads in core files can have a non-<br>
> zero signal. See comments below.<br>
> <br>
> > On Nov 11, 2016, at 5:36 AM, Howard Hellyer via lldb-dev <lldb-<br>
> dev@lists.llvm.org> wrote:<br>
> > <br>
> > Hi Jim <br>
> > <br>
> > I was afraid someone would say that but I've done some digging
and<br>
> found a difference in the core files I get generated by gcore to <br>
> those generated by a crash or abort. <br>
> > <br>
> > Most of the core files have one SIGINFO structure in the core,
I <br>
> think it belongs to the preceding thread (the one that caught the
signal). <br>
> > In the core files generated by gcore all of the threads have
a <br>
> SIGINFO structure following their PRSTATUS structure. In the non-<br>
> gcore files the value of info.si_signo in the PRSTATUS structure is
<br>
> a signal number. In the gcore file this is actually 0 but the <br>
> SIGINFO structure following PRSTATUS has an si_signo value of 19.
<br>
> > <br>
> > Looking at it with eu-readelf shows: <br>
> > <br>
> > CORE
336 PRSTATUS <br>
> > info.si_signo: 0, info.si_code: 0, info.si_errno:
0, cursig: 0 <br>
> > sigpend: <> <br>
> > sighold: <> <br>
> > ... lots of registsers... <br>
> > CORE
128 SIGINFO <br>
> > si_signo: 19, si_errno: 0, si_code: 0 <br>
> > sender PID: 0, sender UID: 0 <br>
> > <br>
> > I think gcore is being clever. It's including the "real"
signal <br>
> number the running thread had received at the time the core was <br>
> taken (info.si_signo is 0) but also the signal it had used to <br>
> interrupt the thread and gather it's state. The value in PRSTATUS
<br>
> info.si_signo is the signal number that ends up in m_signo in <br>
> ThreadElfCore and ultimately is looked for in the set of signals <br>
> lldb should stop on in UnixSignals::GetShouldStop. 0 is not found
in<br>
> that set since there isn't a signal 0. I think gcore is doing all
<br>
> this so that it preserves the real signal state the process had <br>
> before gcore attached to it, I guess in case you are trying to debug<br>
> something to do with signals and need to see that state. (That's a
<br>
> bit of a guess mind you.) <br>
> > <br>
> > I can think of three solutions: <br>
> > <br>
> > - Read the signal information from the SIGINFO block for a thread
<br>
> if it's present. Core files generated by abort or a crash only seem
<br>
> to have a SIGINFO for one thread which looks like it's the one that
<br>
> received/trigger the signal in the first place. This means adding
a <br>
> something to parse that block out of the elf core as well as <br>
> PRSTATUS and override the state from PRSTATUS if we see it. SIGINFO
<br>
> always seems to come after PRSTATUS and probably has to as PRSTATUS
<br>
> contains the pid and identifies that there is a new thread in the
<br>
> core so if SIGINFO is found that signal number will just replace thefirst
one.<br>
> <br>
> You want to figure out which one the accurate signal and use that.
<br>
> Doesn't matter how you do this, but this will be up to the <br>
> ProcessELFCore or ThreadELFCore classes.<br>
> > <br>
> > - Never allow a threads signal number to be 0 when it comes form
<br>
> an elf core dump. (This is probably as much of a band aid as the <br>
> first solution.)<br>
> <br>
> Threads should be able to have no signal. If you have 10 threads and<br>
> thread 6 crashes with SIGABRT, but all other threads were just <br>
> running, I would expect all threads except for thread 6 to have 0
<br>
> signal values, or no stop reason. If you end up with 10 threads and
<br>
> all have no signal information, I would say that you can just give
<br>
> the first thread a SIGSTOP to be safe.<br>
> <br>
> > <br>
> > - Stick with the first solution of saying that we can never resume<br>
> a core file. The only thing in this solutions favour is that it <br>
> means the "real" thread state that gcore tried to preserve
is known <br>
> to lldb. Once the core file is loaded typing continue does result
in<br>
> an error message telling you that you can't resume from a core file.
<br>
> <br>
> The suggested can be done in a cleaner way: Have ProcessELFCore and
<br>
> ProcessMachCore override "Error Process::WillResume()" just
return an error:<br>
> <br>
> Error ProcessELFCore::WillResume() <br>
> {<br>
> return Error("can't resume a process in a core
file");<br>
> }<br>
> <br>
> So I think the correct fix is all three of the above.<br>
> <br>
> Greg<br>
> <br>
> > <br>
> > I'll have a go at prototyping the solution to read the SIGINFO
<br>
> structure but I'd appreciate any thoughts on which is the "correct"
fix. <br>
> > <br>
> > Thanks, <br>
> > <br>
> > <br>
> > Howard Hellyer <br>
> > IBM Runtime Technologies, IBM Systems
</font></tt><font size=2 face="sans-serif"><br>
Unless stated otherwise above:<br>
IBM United Kingdom Limited - Registered in England and Wales with number
741598. <br>
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU<br>
</font>