[lldb-dev] Running other threads while stepping

Thu Sep 5 19:28:17 PDT 2013

On Sep 5, 2013, at 5:20 PM, Ed Maste <emaste at freebsd.org> wrote:

> On 5 September 2013 19:37, Kaylor, Andrew <andrew.kaylor at intel.com> wrote:
>> When you say that you end up stopping on a breakpoint in another thread, do you mean that you stop at the user breakpoint that you set on the sleep call or that the other thread stops on the internal breakpoint associated with the step command?
> 
> It's the former - when I step I arrive at the sleep(1) in the main thread.
> 
>> If the other thread is hitting the user breakpoint, that sounds reasonable and it's probably just a matter of test timing whether or not it happens.  That is, you might hit the breakpoint in both threads at once and have it behave like a single stop (except that both thread stops should be reported), or you might hit just one and then hit the other during stepping.  In the log (assuming that Linux process logging is enabled), the case of both breakpoints being hit at once would show up as a breakpoint event being processed while we were trying to stop all threads.
> 
> In the FreeBSD case this race doesn't exist -- when the first thread
> hits the breakpoint the entire process is stopped, and there's no way
> to return another stop event until after the process is restarted.
> Looking at the log confirms - both threads (the 3rd hadn't been
> started yet) are set to running for the step.
> 
>> Incidentally, one of the changes in r166732 looks wrong.  The first change in ThreadPlanStepInRange.cpp was this:
>> 
>> +        if (m_stop_others == lldb::eOnlyThisThread)
>>             stop_others = false;
>> +        else
>> +            stop_others = true;
> 
> Yes, this looks backwards.  I did some further digging into one case
> of my issue and it comes from the equivalent test in
> ThreadPlanStepOverRange.cpp:
> 
>    if (m_stop_others == lldb::eOnlyThisThread)
>        stop_others = true;
>    else
>        stop_others = false;
> 
> which at least has the expected sense.
> 
> In my case m_stop_others is eOnlyDuringStepping.  With the above test
> stop_others is false and so ThreadPlanStepOverRange::ShouldStop calls
> m_thread.QueueThreadPlanForStepOut and queues a thread plan that runs
> the other threads.
> 
> Presumably I can explicitly suspend all threads but one before
> stepping.  From a user's perspective though it currently seems
> arbitrary whether or not other threads will run during a step.

Whether to run one or all threads is a balancing act between the desire on the one hand to have "step" and "next" not cause other threads to run, and on the other hand to not cause the process you're debugging to deadlock and stall because the thread you ARE stepping or nexting tried to acquire a lock held by another thread.

lldb's stepping logic makes the assumption that single stepping over instructions in the frame the user is currently moving through can be safely done when running only one thread, but anything that might cause arbitrary code to run is NOT safe to do with just one thread running.  The rational behind this is that you probably aren't ever going to try to step over the instruction that actually tries to acquire a lock, but you very well might step over a line of code that contains a function call that tries to acquire a lock.  In practice, this works pretty well.  

This isn't academic.  Back when I wasn't letting all threads run on the "step-out" that happened while doing "step-over" I got a bunch of reports of "step-over" leading to deadlocks in the application being debugged.  I haven't had any since making this work the way it is now.  It might seem a little arbitrary to users whether other threads get to run or not, but it is better than having the user have to manage cases where their process should have made some progress during this step but is isn't, say by interrupting it, and continuing the logical step operation you were in the middle of, but this time with more threads allowed to run.

The other option would be to do some timing kind of heuristic, where we let only one thread run with a timeout, and if it doesn't return after that timeout we interrupt it resume it with all threads running.  That might be worth trying, but you'd have to put in place a mechanism for ThreadPlans to register a timeout and a callback on timeout that could adjust and restart the process, and that isn't in place yet.  Not sure this would appear any more arbitrary than what we do now, however, though it might allow more code to run on only a single thread.

So, whenever you are single stepping from instruction to instruction in the frame you started in, we consider it safe to only run one thread.  But if you have to do a "step-out" you are allowing arbitrary code to run, and that is unsafe without allowing all threads to run.  So in that case, we allow all threads to run.  Note, the "stop_others" is only used if we end up pushing another plan, generally a step-out or step-through plan.  We use the value in m_stop_others in the step thread plan itself if we decide we've got more stepping in this range to do.

That's the way it is supposed to work...  It does look like ThreadPlanStepInRange test is some kind of thinko.  In practice, in this case it is probably safe to run only one thread when doing the "step through" since that generally involved running from a shared library stub to its target.  That could deadlock if the dynamic loader has to fix up the symbol, and another thread is in the middle of doing that.  But that doesn't seem to be very common, or at least the code is clearly wrong but I haven't had any reports of this causing deadlocks...

The StepOverRange one is correct.  The default for stepping is eOnlyDuringStepping, by which I really mean "only when stepping directly through the code in the current frame" and is a best effort kind of thing.  Changing the run mode to eOnlyThisThread should only be done if you KNOW there is no possibility of deadlocks, or are willing to handle working out of the deadlocks yourself.  For the purposes of testing, you could switch the mode to eOnlyThisThread in your steps if you want to force the order in which things happen.  But that shouldn't be the default.

> Incidentally I found what appears to be another bug here (in
> ThreadPlanStepOverRange::ShouldStop).  older_ctx_is_equivalent is
> initialized to true, then set to true if certain conditions are met.
> I presume this change is suitable:
> 
> --- a/source/Target/ThreadPlanStepOverRange.cpp
> +++ b/source/Target/ThreadPlanStepOverRange.cpp
> @@ -121,7 +121,7 @@ ThreadPlanStepOverRange::ShouldStop (Event *event_ptr)
>             // in so I left out the target check.  And sometimes the
> module comes in as the .o file from the
>             // inlined range, so I left that out too...
> 
> -            bool older_ctx_is_equivalent = true;
> +            bool older_ctx_is_equivalent = false;
>             if (m_addr_context.comp_unit)
>             {
>                 if (m_addr_context.comp_unit == older_context.comp_unit)

Yes, that seems wrong.  This code is there as a backstop for when the unwinder drops a frame at the beginning of new function/trampoline or whatever.  In that case, (older_ctx_is_equivalent == false) we will see if we are at a trampoline function that somebody knows how to get out of, and otherwise we will stop.

My guess is that doesn't happen very often anymore, so the sanity check is always succeeding anyway.  Your suggestion seems fine.

Jim