[lldb-dev] A nasty failure in TestConcurrentEvents on linux

Pavel Labath labath at google.com
Thu Mar 19 08:47:14 PDT 2015


After a considerate amount of effort, I have managed to get a log trace of
a rare failure in TestConcurrentEvents
(test_two_watchpoints_one_signal_dwarf). Unfortunately, I do not see a
clear way how to avoid it, so I'm writing here in case anyone has an idea.

Here is my interpretation of the events, backed by log snippets (full log
in the attachment, I also can provide the corresponding core file if anyone
wants to dig around).

The inferior starts out by creating three threads. Of those, two are set up
to hit a watchpoint and one signals itself.

411 1426700114.595431000 [7295/729c]: MonitorChildProcessThreadFunction
::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346,
status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0

On line 411 we see llgs catching the USR1(10) signal.

412 1426700114.595450000 [7295/729c]: NativeProcessLinux::MonitorSignal()
received signal SIGUSR1 (10) with code SI_TKILL, (siginfo pid = 29339 (not
from llgs), waitpid pid = 29346)

Here, we see that the ptrace(GETSIGINFO) call was successful, we get info
about the signal and continue on our way, stop all other threads (which hit
a watchpoint by the way), report this to lldb, etc. After some back and
forth, lldb tells us to resume the inferior and let it process the SIGUSR1
signal.

548 1426700116.644772000 [7295/729a]: NativeProcessLinux::Resume() resuming
thread = 29346 with signal SIGUSR1

However, here things get strange. Immediately after that we get a second
waitpid notification,

555 1426700116.644896000 [7295/729c]: MonitorChildProcessThreadFunction
::waitpid (pid = -29339, &status, options = 1073741824) => pid = 29346,
status = 0x00000a7f (STOPPED), signal = 10, exit_state = 0

This shouldn't happen as we have just reinjected the signal for the process
to handle, and no other signal was sent to the thread in the meantime. But
we don't know that so we try to process the signal as usual, by calling
ptrace(GETSIGINFO).

558 1426700116.644932000 [7295/729c]: ThreadStateCoordinator::EnqueueEvent
enqueued event: EventThreadDeath (tid=29346)
561 1426700116.644961000 [7295/729c]: NativeProcessLinux::MonitorCallback
GetSignalInfo failed: No such process, tid = 29346, signal = 10, status = 0
(thread/process killed, is not main thread, thread metadata removed)

These lines tell us that we are now in the branch where ptrace(GETSIGINFO)
failed with ESRCH. This can happen for two reasons:
- the thread in question has died without us noticing
- the thread is running
Since we have just received a waitpid notification, we know that the thread
should not be running and we assume that the thread has died between
waitpid and getsiginfo calls. We clean up our internal structures and
forget about the thread.

However, the thread is very much alive and kicking and we get another
notification from the thread on line 567. Now we try to find the thread in
our data structures and assert because it does not exist.


These events would seem to indicate that the USR1 notification on line 555
was bogus - there should be no additional signal, and even if there was, it
should have stopped the thread, but it didn't do that.

So it seems that we should just ignore bogus signals (= assume they don't
stop the thread and they don't kill it). However, it is not easy to
differentiate between the three scenarios:
- real signal - waitpid returns, getsiginfo suceeds - thread is stopped
- bogus signal - waitpid returns, getsiginfo fails  - thread is running
- thread death - waitpid returns, getsiginfo fails  - thread is dead


Does anyone have an idea how to address this? I have a few, but I don't
like any of them very much, so I'd rather keep them to myself for now.
Besides, this mail has gotten too long already. If you're still reading
this, give yourself a cookie. :)

pl
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20150319/87bacde5/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: core-lldb.log
Type: text/x-log
Size: 72373 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20150319/87bacde5/attachment.bin>


More information about the lldb-dev mailing list