[Lldb-commits] [lldb] [lldb] Detect a Darwin kernel issue and work around it (PR #81573)
Jason Molenda via lldb-commits
lldb-commits at lists.llvm.org
Mon Feb 12 21:51:52 PST 2024
https://github.com/jasonmolenda created https://github.com/llvm/llvm-project/pull/81573
On arm64 machines, when there is a hardware breakpoint or watchpoint set, and lldb has instruction-stepped a thread, and then done a Process::Resume, we will sometimes receive an extra "instruction step completed" mach exception and the pc has not advanced. From a user's perspective, they hit Continue and lldb stops again at the same spot. From the testsuite's perspective, this has been a constant source of testsuite failures for any test using hardware watchpoints and breakpoints, the arm64 CI bots seem especially good at hitting this issue.
Jim and I have been slowly looking at this for a few months now, and finally I decided to try to detect this situation in lldb and silently resume the process again when it happens.
We were already detecting this "got an insn-step finished mach exception but this thread was not instruction stepping" combination in StopInfoMachException where we take the mach exception and create a StopInfo object for it. We had a lot of logging we used to understand the failure as it was hit on the bots in assert builds.
This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the StopInfo's (new) `IsContinueInterrupted()` method. In StopInfoMachException, where we previously had logging for assert builds, I now note it in an ivar, and when `Thread::GetPrivateStopInfo()` asks if this has happened, we check all of the combination of events that this comes up: We have a hardware breakpoint or watchpoint, we were not instruction stepping this thread but got an insn-step mach exception, the pc is the same as the previous stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no StopInfo -- indicating that this thread would like to resume execution.
The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and `m_prev_frames_sp`. When a thread resumes execution, we move `m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing, w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of the stack is the same as before.
In this same location, I now save the Thread's RegisterContext::GetPC into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this information to check all of the conditions I outlined above for `IsContinueInterrupted`.
This has passed exhaustive testing and we do not have any testsuite failures for hardware watchpoints and breakpoints due to this kernel bug with the patch in place. In focusing on these tests for thousands of runs, I have found two other uncommon race conditions for the TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which uses no hardware watchpoint/breakpoints) will sometimes only have 99 breakpoints when it expects 100, and any of the concurrent tests using the shared harness (I've seen it in
TestConcurrentWatchBreakDelay.py, TestConcurrentTwoBreakpointsOneSignal.py, TestConcurrentSignalDelayWatch.py) can fail when the test harness checks that there is only one thread still running at the end, and it finds two -- one of them under pthread_exit / pthread_terminate. Both of these failures happen on github main without my changes, and with my changes - they are unrelated race conditions in these tests, and I'm sure I'll be looking into them at some point if they hit the CI bots with frequency. On my computer, these are in the 0.3-0.5% of the time class. But the CI bots do have different timing.
>From 177a242271c99ed3fec4c5f211cd6d07c6d88488 Mon Sep 17 00:00:00 2001
From: Jason Molenda <jason at molenda.com>
Date: Mon, 12 Feb 2024 21:34:13 -0800
Subject: [PATCH] [lldb] Detect a Darwin kernel issue and work around it
On arm64 machines, when there is a hardware breakpoint or watchpoint
set, and lldb has instruction-stepped a thread, and then done a
Process::Resume, we will sometimes receive an extra "instruction
step completed" mach exception and the pc has not advanced. From
a user's perspective, they hit Continue and lldb stops again at the
same spot. From the testsuite's perspective, this has been a
constant source of testsuite failures for any test using hardware
watchpoints and breakpoints, the arm64 CI bots seem especially good
at hitting this issue.
Jim and I have been slowly looking at this for a few months now,
and finally I decided to try to detect this situation in lldb and
silently resume the process again when it happens.
We were already detecting this "got an insn-step finished mach
exception but this thread was not instruction stepping" combination
in StopInfoMachException where we take the mach exception and create
a StopInfo object for it. We had a lot of logging we used to
understand the failure as it was hit on the bots in assert builds.
This patch adds a new case to `Thread::GetPrivateStopInfo()` to
call the StopInfo's (new) `IsContinueInterrupted()` method. In
StopInfoMachException, where we previously had logging for assert
builds, I now note it in an ivar, and when `Thread::GetPrivateStopInfo()`
asks if this has happened, we check all of the combination of events
that this comes up: We have a hardware breakpoint or watchpoint,
we were not instruction stepping this thread but got an insn-step
mach exception, the pc is the same as the previous stop's pc. And
in that case, `Thread::GetPrivateStopInfo()` returns no StopInfo
-- indicating that this thread would like to resume execution.
The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and
`m_prev_frames_sp`. When a thread resumes execution, we move
`m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops
executing, w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp`
if most of the stack is the same as before.
In this same location, I now save the Thread's RegisterContext::GetPC
into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs
this information to check all of the conditions I outlined above
for `IsContinueInterrupted`.
This has passed exhaustive testing and we do not have any testsuite
failures for hardware watchpoints and breakpoints due to this kernel
bug with the patch in place. In focusing on these tests for thousands
of runs, I have found two other uncommon race conditions for the
TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py
(which uses no hardware watchpoint/breakpoints) will sometimes only
have 99 breakpoints when it expects 100, and any of the concurrent
tests using the shared harness (I've seen it in
TestConcurrentWatchBreakDelay.py, TestConcurrentTwoBreakpointsOneSignal.py,
TestConcurrentSignalDelayWatch.py) can fail when the test harness
checks that there is only one thread still running at the end, and
it finds two -- one of them under pthread_exit / pthread_terminate.
Both of these failures happen on github main without my changes,
and with my changes - they are unrelated race conditions in these
tests, and I'm sure I'll be looking into them at some point if they
hit the CI bots with frequency. On my computer, these are in the
0.3-0.5% of the time class. But the CI bots do have different
timing.
---
lldb/include/lldb/Target/StopInfo.h | 5 ++
lldb/include/lldb/Target/Thread.h | 7 ++
.../Process/Utility/StopInfoMachException.cpp | 83 +++++++++++++------
.../Process/Utility/StopInfoMachException.h | 11 ++-
lldb/source/Target/Thread.cpp | 19 ++++-
5 files changed, 94 insertions(+), 31 deletions(-)
diff --git a/lldb/include/lldb/Target/StopInfo.h b/lldb/include/lldb/Target/StopInfo.h
index 305fc5d0e0fbe5..1eb409fdefe392 100644
--- a/lldb/include/lldb/Target/StopInfo.h
+++ b/lldb/include/lldb/Target/StopInfo.h
@@ -79,6 +79,11 @@ class StopInfo : public std::enable_shared_from_this<StopInfo> {
virtual bool IsValidForOperatingSystemThread(Thread &thread) { return true; }
+ /// A Continue operation can result in a false stop event
+ /// before any execution has happened in certain cases on Darwin.
+ /// We need to silently continue again time.
+ virtual bool IsContinueInterrupted(Thread &thread) { return false; }
+
// Sometimes the thread plan logic will know that it wants a given stop to
// stop or not, regardless of what the ordinary logic for that StopInfo would
// dictate. The main example of this is the ThreadPlanCallFunction, which
diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h
index e423dd4a6d2baa..15ff8661dd605c 100644
--- a/lldb/include/lldb/Target/Thread.h
+++ b/lldb/include/lldb/Target/Thread.h
@@ -11,6 +11,7 @@
#include <memory>
#include <mutex>
+#include <optional>
#include <string>
#include <vector>
@@ -1226,6 +1227,7 @@ class Thread : public std::enable_shared_from_this<Thread>,
friend class StackFrameList;
friend class StackFrame;
friend class OperatingSystem;
+ friend class StopInfoMachException;
// This is necessary to make sure thread assets get destroyed while the
// thread is still in good shape to call virtual thread methods. This must
@@ -1262,6 +1264,8 @@ class Thread : public std::enable_shared_from_this<Thread>,
lldb::StackFrameListSP GetStackFrameList();
+ std::optional<lldb::addr_t> GetPreviousFrameZeroPC();
+
void SetTemporaryResumeState(lldb::StateType new_state) {
m_temporary_resume_state = new_state;
}
@@ -1299,6 +1303,9 @@ class Thread : public std::enable_shared_from_this<Thread>,
///populated after a thread stops.
lldb::StackFrameListSP m_prev_frames_sp; ///< The previous stack frames from
///the last time this thread stopped.
+ std::optional<lldb::addr_t>
+ m_prev_framezero_pc; ///< Frame 0's PC the last
+ /// time this thread was stopped.
int m_resume_signal; ///< The signal that should be used when continuing this
///thread.
lldb::StateType m_resume_state; ///< This state is used to force a thread to
diff --git a/lldb/source/Plugins/Process/Utility/StopInfoMachException.cpp b/lldb/source/Plugins/Process/Utility/StopInfoMachException.cpp
index d756354f9bd278..d511612b7767c0 100644
--- a/lldb/source/Plugins/Process/Utility/StopInfoMachException.cpp
+++ b/lldb/source/Plugins/Process/Utility/StopInfoMachException.cpp
@@ -26,6 +26,8 @@
#include "lldb/Target/Thread.h"
#include "lldb/Target/ThreadPlan.h"
#include "lldb/Target/UnixSignals.h"
+#include "lldb/Utility/LLDBLog.h"
+#include "lldb/Utility/Log.h"
#include "lldb/Utility/StreamString.h"
#include <optional>
@@ -596,6 +598,7 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
if (exc_type == 0)
return StopInfoSP();
+ bool not_stepping_but_got_singlestep_exception = false;
uint32_t pc_decrement = 0;
ExecutionContext exe_ctx(thread.shared_from_this());
Target *target = exe_ctx.GetTargetPtr();
@@ -720,30 +723,8 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
// is set
is_actual_breakpoint = true;
is_trace_if_actual_breakpoint_missing = true;
-#ifndef NDEBUG
- if (thread.GetTemporaryResumeState() != eStateStepping) {
- StreamString s;
- s.Printf("CreateStopReasonWithMachException got EXC_BREAKPOINT [1,0] "
- "indicating trace event, but thread is not tracing, it has "
- "ResumeState %d",
- thread.GetTemporaryResumeState());
- if (RegisterContextSP regctx = thread.GetRegisterContext()) {
- if (const RegisterInfo *ri = regctx->GetRegisterInfoByName("esr")) {
- uint32_t esr =
- (uint32_t)regctx->ReadRegisterAsUnsigned(ri, UINT32_MAX);
- if (esr != UINT32_MAX) {
- s.Printf(" esr value: 0x%" PRIx32, esr);
- }
- }
- }
- thread.GetProcess()->DumpPluginHistory(s);
- llvm::report_fatal_error(s.GetData());
- lldbassert(
- false &&
- "CreateStopReasonWithMachException got EXC_BREAKPOINT [1,0] "
- "indicating trace event, but thread was not doing a step.");
- }
-#endif
+ if (thread.GetTemporaryResumeState() != eStateStepping)
+ not_stepping_but_got_singlestep_exception = true;
}
if (exc_code == 0x102) // EXC_ARM_DA_DEBUG
{
@@ -825,6 +806,56 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
break;
}
- return StopInfoSP(new StopInfoMachException(thread, exc_type, exc_data_count,
- exc_code, exc_sub_code));
+ return StopInfoSP(new StopInfoMachException(
+ thread, exc_type, exc_data_count, exc_code, exc_sub_code,
+ not_stepping_but_got_singlestep_exception));
+}
+
+// Detect an unusual situation on Darwin where:
+//
+// 0. We did an instruction-step before this.
+// 1. We have a hardware breakpoint or watchpoint set.
+// 2. We are resuming the process.
+// 3. The thread gets an "instruction-step completed" mach exception.
+// 4. The pc has not advanced - it is the same as before.
+//
+// This method returns true for that combination of events.
+bool StopInfoMachException::IsContinueInterrupted(Thread &thread) {
+ Log *log = GetLog(LLDBLog::Step);
+
+ // We got an instruction-step completed mach exception but we were not
+ // doing an instruction step on this thread.
+ if (!m_not_stepping_but_got_singlestep_exception)
+ return false;
+
+ RegisterContextSP reg_ctx_sp(thread.GetRegisterContext());
+ std::optional<addr_t> prev_pc = thread.GetPreviousFrameZeroPC();
+ if (!reg_ctx_sp || !prev_pc)
+ return false;
+
+ // The previous pc value and current pc value are the same.
+ if (*prev_pc != reg_ctx_sp->GetPC())
+ return false;
+
+ // We have a watchpoint -- this is the kernel bug.
+ ProcessSP process_sp = thread.GetProcess();
+ if (process_sp->GetWatchpointResourceList().GetSize()) {
+ LLDB_LOGF(log,
+ "Thread stopped with insn-step completed mach exception but "
+ "thread was not stepping; there is a hardware watchpoint set.");
+ return true;
+ }
+
+ // We have a hardware breakpoint -- this is the kernel bug.
+ auto &bp_site_list = process_sp->GetBreakpointSiteList();
+ for (auto &site : bp_site_list.Sites()) {
+ if (site->IsHardware() && site->IsEnabled()) {
+ LLDB_LOGF(log,
+ "Thread stopped with insn-step completed mach exception but "
+ "thread was not stepping; there is a hardware breakpoint set.");
+ return true;
+ }
+ }
+
+ return false;
}
diff --git a/lldb/source/Plugins/Process/Utility/StopInfoMachException.h b/lldb/source/Plugins/Process/Utility/StopInfoMachException.h
index 541ef5e69565de..aa4c00b4d0edce 100644
--- a/lldb/source/Plugins/Process/Utility/StopInfoMachException.h
+++ b/lldb/source/Plugins/Process/Utility/StopInfoMachException.h
@@ -31,9 +31,12 @@ class StopInfoMachException : public StopInfo {
// Constructors and Destructors
StopInfoMachException(Thread &thread, uint32_t exc_type,
uint32_t exc_data_count, uint64_t exc_code,
- uint64_t exc_subcode)
+ uint64_t exc_subcode,
+ bool not_stepping_but_got_singlestep_exception)
: StopInfo(thread, exc_type), m_exc_data_count(exc_data_count),
- m_exc_code(exc_code), m_exc_subcode(exc_subcode) {}
+ m_exc_code(exc_code), m_exc_subcode(exc_subcode),
+ m_not_stepping_but_got_singlestep_exception(
+ not_stepping_but_got_singlestep_exception) {}
~StopInfoMachException() override = default;
@@ -58,10 +61,14 @@ class StopInfoMachException : public StopInfo {
uint64_t exc_code, uint64_t exc_sub_code, uint64_t exc_sub_sub_code,
bool pc_already_adjusted = true, bool adjust_pc_if_needed = false);
+ bool IsContinueInterrupted(Thread &thread) override;
+
protected:
uint32_t m_exc_data_count;
uint64_t m_exc_code;
uint64_t m_exc_subcode;
+
+ bool m_not_stepping_but_got_singlestep_exception;
};
} // namespace lldb_private
diff --git a/lldb/source/Target/Thread.cpp b/lldb/source/Target/Thread.cpp
index 8ae2179c1281d0..4fa8cb0940bd25 100644
--- a/lldb/source/Target/Thread.cpp
+++ b/lldb/source/Target/Thread.cpp
@@ -221,7 +221,7 @@ Thread::Thread(Process &process, lldb::tid_t tid, bool use_invalid_index_id)
: process.GetNextThreadIndexID(tid)),
m_reg_context_sp(), m_state(eStateUnloaded), m_state_mutex(),
m_frame_mutex(), m_curr_frames_sp(), m_prev_frames_sp(),
- m_resume_signal(LLDB_INVALID_SIGNAL_NUMBER),
+ m_prev_framezero_pc(), m_resume_signal(LLDB_INVALID_SIGNAL_NUMBER),
m_resume_state(eStateRunning), m_temporary_resume_state(eStateRunning),
m_unwinder_up(), m_destroy_called(false),
m_override_should_notify(eLazyBoolCalculate),
@@ -250,6 +250,7 @@ void Thread::DestroyThread() {
std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
m_curr_frames_sp.reset();
m_prev_frames_sp.reset();
+ m_prev_framezero_pc.reset();
}
void Thread::BroadcastSelectedFrameChange(StackID &new_frame_id) {
@@ -422,6 +423,12 @@ lldb::StopInfoSP Thread::GetPrivateStopInfo(bool calculate) {
}
}
}
+
+ // If we were resuming the process and it was interrupted,
+ // return no stop reason. This thread would like to resume.
+ if (m_stop_info_sp && m_stop_info_sp->IsContinueInterrupted(*this))
+ return {};
+
return m_stop_info_sp;
}
@@ -1408,16 +1415,22 @@ StackFrameListSP Thread::GetStackFrameList() {
return m_curr_frames_sp;
}
+std::optional<addr_t> Thread::GetPreviousFrameZeroPC() {
+ return m_prev_framezero_pc;
+}
+
void Thread::ClearStackFrames() {
std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
GetUnwinder().Clear();
+ m_prev_framezero_pc.reset();
+ if (RegisterContextSP reg_ctx_sp = GetRegisterContext())
+ m_prev_framezero_pc = reg_ctx_sp->GetPC();
// Only store away the old "reference" StackFrameList if we got all its
// frames:
// FIXME: At some point we can try to splice in the frames we have fetched
- // into
- // the new frame as we make it, but let's not try that now.
+ // into the new frame as we make it, but let's not try that now.
if (m_curr_frames_sp && m_curr_frames_sp->GetAllFramesFetched())
m_prev_frames_sp.swap(m_curr_frames_sp);
m_curr_frames_sp.reset();
More information about the lldb-commits
mailing list