[Lldb-commits] [PATCH] D129814: Fix stepping over watchpoints in architectures that raise the exception before executing the instruction

Mon Jul 18 01:34:49 PDT 2022

labath accepted this revision.
labath added a comment.
This revision is now accepted and ready to land.

In D129814#3655878 <https://reviews.llvm.org/D129814#3655878>, @jingham wrote:

> In D129814#3654368 <https://reviews.llvm.org/D129814#3654368>, @labath wrote:
>
>> In D129814#3654276 <https://reviews.llvm.org/D129814#3654276>, @jasonmolenda wrote:
>>
>>> In D129814#3654230 <https://reviews.llvm.org/D129814#3654230>, @labath wrote:
>>>
>>>> Generally, this makes sense to me, but I do have one question (not necessarily for Jim). These tests work on (arm) linux, and I am wondering why is that the case. Could it be related to the fact that lldb-server preserves the stop reason for threads that are not running? I.e. if we have two threads hit a breakpoint (or whatever), and then step one of them, then the other thread will still report a stop_reason=breakpoint. Do you know if this is supposed to happen (e.g. does debugserver do that)?
>>>
>>> I'd have to go back and test it to be 100% sure but from the behavior I see on Darwin systems, when we receive a watchpoint hit on two threads at the same time, and instruction-step the first thread, when we ask debugserver for the current stop-reason on thread two, it says there is no reason.  The watchpoint exception on the second thread is lost.  I could imagine lldb-server behaving differently. (I think when we fetch the mach exceptions for the threads, they've now been delivered, and when we ask the kernel for any pending mach exceptions for the threads a second time -- even if those threads haven't been allowed to run, I think the state is lost, and debugserver didn't save that initial state it received.)
>>
>> Yes, that sounds plausible. It lldb-server (on linux) it happens differently because we store each stop reason inside the thread object, and since we can control each thread independently, there's no reason to touch it if we're not running the thread. Although it also wouldn't be hard to clear in on every resume.
>>
>> So which one of these behaviors would you say is the correct one?
>
> Darwin has always worked in such a way that by the time it stops for an exception it will have given other threads enough chance to run that if they were likely to, other threads will hit the exception before the first exception is delivered to lldb.  In gdb we tried to hide that fact by only publishing one exception, but that was awkward since you when you tried to resume to do some piece of work, you'd immediately stop again without getting to run.  That just lead to confusing interactions (and was a little harder to deal with in the debugger as well).  I think it's clearer to stop and see 5 threads have hit your breakpoint, then decide what to do, than to have one breakpoint reported then when you "continue" immediately turn around and get another without anybody really seeming to make progress, rinse, repeat...
>
> Rereading this I wasn't sure what question you were asking.  I answered "what should you tell lldb if lldb-server sees 5 threads stopped for reasons at one stop point."  But the other question is "what to do with stop info's if you only run one thread".  If lldb knows it has stopped a thread from running when it resumes the target, it also preserves the StopInfo for that thread.  That seems to me the most useful behavior.

Given your latest comment, I think you've understood my question, but to avoid confusion, the situation I was referring to is when lldb**-server** sees 5 threads stopped for some reason, and then lldb steps one of them. The question was whether lldb-server should still report the original stop reason of those threads (if asked -- theoretically lldb should know that the reason hasn't changed and might decide not to ask).

I agree that preserving the stop reason is the most useful behavior for **lldb** (the threads haven't run, so why should their state change?), but then one could apply the same logic to the lldb-/debugserver component as well.

In D129814#3656658 <https://reviews.llvm.org/D129814#3656658>, @jingham wrote:

> In D129814#3655750 <https://reviews.llvm.org/D129814#3655750>, @jasonmolenda wrote:
>
>> In D129814#3654368 <https://reviews.llvm.org/D129814#3654368>, @labath wrote:
>>
>>> In D129814#3654276 <https://reviews.llvm.org/D129814#3654276>, @jasonmolenda wrote:
>>>
>>>> In D129814#3654230 <https://reviews.llvm.org/D129814#3654230>, @labath wrote:
>>>>
>>>>> Generally, this makes sense to me, but I do have one question (not necessarily for Jim). These tests work on (arm) linux, and I am wondering why is that the case. Could it be related to the fact that lldb-server preserves the stop reason for threads that are not running? I.e. if we have two threads hit a breakpoint (or whatever), and then step one of them, then the other thread will still report a stop_reason=breakpoint. Do you know if this is supposed to happen (e.g. does debugserver do that)?
>>>>
>>>> I'd have to go back and test it to be 100% sure but from the behavior I see on Darwin systems, when we receive a watchpoint hit on two threads at the same time, and instruction-step the first thread, when we ask debugserver for the current stop-reason on thread two, it says there is no reason.  The watchpoint exception on the second thread is lost.  I could imagine lldb-server behaving differently. (I think when we fetch the mach exceptions for the threads, they've now been delivered, and when we ask the kernel for any pending mach exceptions for the threads a second time -- even if those threads haven't been allowed to run, I think the state is lost, and debugserver didn't save that initial state it received.)
>>>
>>> Yes, that sounds plausible. It lldb-server (on linux) it happens differently because we store each stop reason inside the thread object, and since we can control each thread independently, there's no reason to touch it if we're not running the thread. Although it also wouldn't be hard to clear in on every resume.
>>>
>>> So which one of these behaviors would you say is the correct one?
>>
>> Yeah, as I was writing my explanation for why we see this problem on Darwin, I realized the obvious next question was, "why isn't this a debugserver fix" - I think that would be another valid way to approach this, although we still need to interop with older debugservers on iOS etc devices from the past five years.  (to be fair, it's relatively uncommon that this issue is hit in real-world programs, our API tests stress this area specifically to expose bugs)  I do wonder about the failure mgorny reported in https://github.com/llvm/llvm-project/issues/48777 which sound like exactly the same problems as we have on Darwin, but on a BSD or linux system running under AArch64 qemu?  (he doesn't say which OS he was running under qemu)
>>
>> That being said, compensating for this debugserver deficiency in lldb doesn't seem like a bad thing to me, given that the patch exists.
>
> I'm pretty sure "saving suspended thread's stop reasons across a resume" is not a requirement of the gdb-remote protocol, so I don't think we should rely on that behavior.

What makes you so sure of that? As you've said before, the idea of having multiple stop reasons per stop is our own invention, so it's not like we have a precedent or a spec we compare ourselves to.

The way I see it, one can fairly easily find justifications for both positions. In a world where we don't save the stop reasons, the reasons are not so much a property **of** a thread, as they are **about** a thread. They are a property of a stop, and once move on from that "stop", the stop reasons go away. If you wanted to preserve the stop reasons, you'd argue that the stop reasons are an intrinsic property of a thread, and since the thread hasn't run, its stop reason shouldn't change.

I don't think either of them is wrong. Just, depending on how you look at things, one of them feels more natural. And if you're on a system where you have to actively suspend a thread to prevent it from running during a resume, you might incline towards a different interpretation than if you're able to control each thread individually (and "suspending a thread" means "doing nothing").

> Plus, this is the wrong way to use the execution control part of lldb.  It's much harder to reason about the execution control machinery if we're in one round of "How should I react to stop X" and someone just moves the target to Stop X+1 in the middle of that process...  We are in a bit of a tough spot with expressions that run all threads, but (a) that really only happens when user's explicitly ask for it and (b) we already make a distinction between "resumes for expression evaluation" and "natural resumes".  We should certainly not make it worse.  So even if it hadn't caused this bug, that wasn't the right way to implement this bit of business.

Yeah, I'm not arguing against this patch. The reason I brought this up is because it seems we have different behavior of our stubs in these situations. I think that is causing some (other) bugs, and I'd like to know where to fix them.

================
Comment at: lldb/source/Target/StopInfo.cpp:898-901
                 if (scalar_value.ULongLong(1) == 0) {
-                  // We have been vetoed.  This takes precedence over querying
-                  // the watchpoint whether it should stop (aka ignore count
-                  // and friends).  See also StopInfoWatchpoint::ShouldStop()
-                  // as well as Process::ProcessEventData::DoOnRemoval().
+                  // The condition failed, which we consider "not having hit
+                  // the watchpoint" so undo the hit count here.
+                  wp_sp->UndoHitCount();
----------------
jingham wrote:
> labath wrote:
> > How does this relate to the thread plan part of the patch?
> This is a drive-by fix.  The point of this effort was to make the hit counts correct for Watchpoints, and as I was reading through the code I noticed that we were actually failing to do that in two ways: mishandling the case of multiple simultaneous watchpoint hits, and mishandling failed condition hits.
> 
> I mentioned in the initial comments that I fixed them both in one patch, but I can separate them into two patches if that bugs you.
Sorry, I missed that part of the patch description. Well, the main question on my mind was whether I am missing something. Now that that's cleared up, it's not so much of a problem (I am assuming this is related to/tested by the change in `TestWatchpointConditionCmd.py`), though I would generally recommend splitting that off into a separate patch in the future (and I don't think you'd need to create a review for a simple change like this).

================
Comment at: lldb/source/Target/StopInfo.cpp:730
+        StopInfoWatchpoint *as_wp_stop_info 
+            = reinterpret_cast<StopInfoWatchpoint *>(m_stop_info_sp.get());
+        as_wp_stop_info->SetStepOverPlanComplete();
----------------
Change type of `m_stop_info_sp` (and the relevant constructor argument) to `shared_ptr<StopInfoWatchpoint>` to avoid the cast here.

(btw, the right cast for upcasts is `static_cast`. reinterpret_cast can return wrong results in more [[ https://godbolt.org/z/5W7rz3fKK | complicated scenarios ]])

================
Comment at: lldb/source/Target/StopInfo.cpp:805
+      ThreadPlanSP step_over_wp_sp(new ThreadPlanStepOverWatchpoint(
+          *(thread_sp.get()), shared_from_this(), wp_sp));
+      Status error;
----------------
And use `static_pointer_cast<StopInfoWatchpoint>(shared_from_this())` here. That way the casting remains confined within the origin class.

================
Comment at: lldb/source/Target/StopInfo.cpp:992
+  bool m_using_step_over_plan = false;
+  // ThreadPlanSP m_step_over_wp_sp;
 };
----------------
delete

================
Comment at: lldb/test/API/commands/watchpoints/watchpoint_commands/condition/TestWatchpointConditionCmd.py:82-83

         # Use the '-v' option to do verbose listing of the watchpoint.
         # The hit count should now be 2.
         self.expect("watchpoint list -v",
----------------
I guess this comment was wrong before, but better update it now. (or delete it -- it's kind of self-evident)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129814/new/

https://reviews.llvm.org/D129814