[lldb-dev] Locking issues

Fri Apr 19 10:47:50 PDT 2013

On Apr 18, 2013, at 7:17 PM, "Malea, Daniel" <daniel.malea at intel.com> wrote:

> Hi Greg,
> 
> Thanks again for the detailed explanation of this dilly of a pickle. I
> thought about this problem a little more.. It would be really nice to
> resolve this soon, since the different locking behaviour on MacOSX vs
> other platforms is something I think we all want to excise as quickly as
> possible.

Agreed.

> 
> My thoughts are inline below:
> 
> 
>> 
>> No, a python class gets instantiated when a process starts up and it gets
>> a chance to run python code in order to make new threads lists up and
>> relay them back to LLDB. The python code must use the public SB API.
> 
> Oh, that's somewhat surprising. Where is this class?

Clients must manually enable this when they want it via a setting:

(lldb) settings set target.process.python-os-plugin-path /path/to/module.py

Then it from this module it will load the class named "OperatingSystemPlugIn" and the process will instantiate this class with one instance per process that gets used in the OperatingSystemPython plug-in.

> I'm curious about
> what it does that's done in Python.. Aren't you worried about the overhead
> the Global Interpreter Lock might add to normal operations like step-over?

This is used mainly for bare board debugging where you have a kernel that has threads that aren't all on core, and it is also used for kernel panic triage. The type of places this is used is not in normal user space debugging, and if it ever is, a native plug-in should be used to get the performance that is needed. 

So no, I am not worried about any overhead. 

The kernels around here change from build to build and from arch to arch so the flexibility of having a python plug-in that can be distributed with each new build allows the kernel team to have a very flexible solution. 

> I have no data to back that up, but putting python stuff on the critical
> path seems like it can't be good for performance.

It isn't in the critical path for the clients that use it. The correct way for clients where it is in the critical path would be to write a native plug-in.

> Also, not that it's
> important to us, but does this also mean that building a working LLDB
> without Python support is no longer an option?

Again, go native if you really care. If you have and can use python and need to be able to change your OS plug-in because you are in the process of developing a new OS, then use the python way so you can see all of the threads you care about.

> 
>>> To confirm my understanding, it seems that "regular"
>>> python code (like the test cases that drive the debugger) which uses SB*
>>> ends up locking the "public" locks whereas "callback" code that I
>>> imagine
>>> gets invoked via OSPython ends up locking the "private" locks -- is this
>>> really true?
>> 
>> Only when the private state thread calls into the python code which then
>> wants to use the public API to lookup variables, evaluate expressions,
>> etc in order to make the new thread list.
>> 
>>> Are these two use-cases really running on different (main vs.
>>> non-main) threads?
>> 
>> When the private state thread is doing stuff, the process run lock is
>> taken. Any public API access that requires a stopped process won't work
>> when being done from the private state thread since the process is
>> implementing some sort of run.
> 
> Right-O. This makes sense.
> 
> 
>> 
>>> 
>>> If so, once the locks are acquired, don't both of these use-cases end up
>>> using the *same* shared Process data? Is there any coupling between the
>>> data that the locks are protecting and the locks themselves? Based on
>>> the
>>> Process::GetRunLock() it seems like an arbitrary lock is chosen to
>>> protect
>>> all of Process' private data.
>> 
>> No, the read/write locks don't protect the data, they just protect access
>> to the process to make sure it stays stopped while clients access things,
>> or make sure clients that want to read, stay locked out while process
>> control is going on. See below for more detail.
> 
> OK, here's where I think things start going sideways. It might be possible
> to accomplish the synchronization that's needed between the public clients
> and internal callers with two R/W locks, but I'm having trouble seeing how
> they are to be used in tandem. I'll attempt to explain (what I think is) a
> simpler approach below.
> 
> 
>>> 
>>> If the public/public run locks *are* coupled to the public/private
>>> states
>>> of the process respectively, why is it that the both write locks need to
>>> be kept for long periods of time? I can understand the approach of
>>> having
>>> an internal (private) "run" write lock being held while the process is
>>> running, but I'm having trouble conceptualizing why the public run lock
>>> needs to be held for similar durations. It seems any synchronization we
>>> do
>>> for the public state should be very short-held "critical section" type
>>> locks while functions query the last-known process state, or just return
>>> an error because the process is running. Any SB function that results
>>> in a
>>> public lock being held for longer than the duration of that function
>>> seems
>>> like a bug in the SB code -- does that make sense or am I way off?
>> 
>> Since we do async process control, so when a user says "step over", this
>> might result in the process being stopped and started 100 times. Before
>> doing any process control, we take the public process run lock by
>> acquiring the write lock. This lets any people currently in the process
>> with the public run lock that was acquired for reading, to get out. Once
>> the public run lock is taken, we keep it until the process run control
>> action ("step over") is complete. We must lock down access to the process
>> for the entire duration because it wouldn't be a good time for someone to
>> come in and ask for the stack frames of a thread in between run 56 and
>> 57. It stack frames would be meaningless from a public perspective.
>> Internally though, for any blessed sections of code (like
>> OperatingSystemPython) which we know will need access to the public API
>> from the private state thread, we allow them to not be locked out by
>> using the private run lock.
> 
> Hmm, OK. So, in the two R/W lock approach, there's places where both locks
> need to be write-acquired though, right? I'm not sure I can think of all
> the other interactions between the locks though. It might be equivalent to
> the thing I propose below -- can you confirm if this is true or not?

It should always be the case that the public process run lock will have its write lock taken first, then the private process run lock would take it and release it as many times as needed while it controls the process, then it would release the private write lock, then the public write lock.

> 
> 
>>> If we make new threads for callbacks, then we would either be locked out
>>> from accessing the process, or we would need to let clients access the
>>> process in between intermediate steps of the current process run
>>> control, which is not what anyone wants.
> 
> Agreed that those two things are bad, but I think a third option exists:
> 
> What we need is akin to a condition variable (that knows how to relinquish
> a lock during wait) so that we don't "lock ourselves out" and a shared
> lock to protect the "public" interfaces that do anything with the process'
> state, with some extra smartness about "long running" operations that
> require nobody mess around with the process state.
> 
> Here's what I'm thinking Process needs to have in order to solve this, in
> a little more concrete terms:
> 1. One bool (m_state_in_transition) -- true when we have a long-running
> operation in progress
> 2. One Condition Variable bool (m_callback_in_progress) -- this remains
> true when a python thread that has SB access is launched
> 3. One Mutex (m_lock)* -- to protect every function inside process that
> reads/writes state
> 
> The rules of the game are:
> 
> - Every function in Process needs to acquire the m_lock before any
> reads/writes happen. This is somewhat shitty because multiple reads would
> be serialized, but more on this later*

Yes, this is why the use read/write locks right now. The "read" clients (want the process to stay stopped) can all access process as they want, the sole "write" client (wants to resume the process) can keep any "read" clients out.

> - Any (non-const) function in Process (that modifies state like Resume(),
> Step(), etc) needs to error out if m_state_in_transition == true. These
> are forbidden in callbacks.

"const" really doesn't mean much with a multi-threaded async process API. So I really don't see how this would work. If you call process->Resume() it will cause some data to get scooted over to the private state thread and all the work is done from there.

> - Any const function in Process (that reads state but does not change it)
> needs to acquire m_lock

Again, see above "const" comment...

> - Any time an internal Process function needs to invoke some code that has
> access to SB APIs, the procedure is:
> -- set m_state_in_transition AND m_callback_in_progress to true
> -- spawn a new thread that¹s going to use the public APIs
> -- wait for m_callback_in_progress to turn false WITH m_mutex,
> relinquishing the lock so others can enter Process internals, BUT
> m_state_in_transition prevents any badness.
> - When any "long-running" function exits, m_state_in_transition is set to
> false.
> 
Yikes, I reall
> Aside: As an implementation detail, the first and the last steps can
> probably be combined in a single scoped object.
> 
> 
> 
> In concert with the above approach, any "callback" (or, 'blessed' I think
> you call it) thread that has access to SB but may run during one of the
> Process internals needs to set m_callback_in_progress to false when done;
> thereby joining itself with the Process internal function that invoked it.
> 
> 
> * Based on the existing design, it seems you want multiple reads to happen
> concurrently (for Xcode performance reasons). To allow multiple readers to
> access Process at once, upgrade the m_lock from a run-of-the-mill Mutex to
> a R/W lock. The problem with this is that we need the condition variable
> to play nice (i.e. unlock the R/W lock during wait().) Pthread doesn't
> support this with the rw_lock type as-is, but I'm sure it can be done if
> we implement ReadWriteLock the "hard" way with lower-level primitives. See
> http://stackoverflow.com/questions/2699993/using-pthread-condition-variable
> -with-rwlock for inspiration.

Yeah, I explored this option when we first did the read/write lock and as soon as you have more than one primitive object (mutex, condition, read/write lock, etc), you are asking for even more trouble.

> 
> 
> The natural benefit of the above approach is that all the
> locking/unlocking magic happen inside Process, and SB has to have 0
> knowledge about Process locks.

So how would you solve the problem where you want to play with items in the process and keep the process from going anywhere? Like

SBProcess process = ...
SBThread thread;
uint32_t num_threads = process.GetNumThreads();
for (uint32_t i=0; i<num_threads; ++i)
{
	thread = process.GetThreadAtIndex(i);
}

In the above code we let process do all the locking we could end up having the process change our from underneath us as we iterate?

Another example:

SBFrame frame;
uint32_t num_frames = thread.GetNumFrames()
for (...)
{
    frame = thread.GetFrameAtIndex(i);
}

> 
>>> Is it not possible to have both "driving" code and "callback"
>>> code go through exactly the same code path the SB* layer as well as the
>>> Process class?
>> 
>> You mean using the same locks? I am confused by your question. All code
>> uses the same public API through the SB* layer. The SB layer locks you
>> out when it is dangerous to do things while the process is running. Many
>> of the issues we ran into stemmed from Xcode accessing the SB* API from
>> multiple threads and all of the locks (the target lock and the process
>> read/write (stopped/running) locks are needed.
> 
> Sorry I was not too clear -- basically I'm worried about the complexity of
> having two discrete R/W locks here, and separate behaviour based on
> internal/external threads.

As opposed to taking a condition variable, locking it, and doing two of three other things? I am not clear how the new scheme simplifies things yet.

> Since processes have one state, it "feels" like
> there should be one lock.
> 
> 
>> Let me know what you think after reading the above explanation. The locks
>> are needed and necessary and we haven't been able to come up with another
>> way to control access to the things we need to do from the various
>> threads. We currently have two clients:
>> - public clients
>> - private process state thread clients
>> 
>> public clients get locked out, as they should, when the process is
>> running (continue, or stepping) so we don't get useless answers.
>> 
>> Do you really want to evaluate the value for "x" when we are in the
>> middle of a single step? No you want to get an error stating "error:
>> process is running".
> 
> Right, expression evaluation would be one of those "non-const" cases that
> should error out if Process m_in_transition is true.

I really don't understand the new scheme you are suggesting. What does m_in_transition mean? In transition between what and what?

> 
>> 
>> Do you really want stack frames for all threads in a process when the
>> process is in the middle of a single step? No you want to get an error
>> stating "error: process is running".
> 
> Right-O.
> 
>> 
>> The private clients though are limited to OperatingSystemPython right
>> now, but could be expanded to breakpoint actions in the near future. A
>> breakpoint might say it wants to have python run when the breakpoint gets
>> hit and it would be great to not have to do this on another thread just
>> so we can see that "x != 12", or "rax != 123".
> 
> I'm not sure another thread for the callback will add a lot of overhead
> compared to the overhead of the Python GIL. But I'm open here -- maybe
> there is a way to do the condition variable thing (or equivalent) without
> spawning another thread. We'd need a mutex-unlock-and-invoke-function sort
> of functionality, but the general principle would be the same; when
> non-Process code is running, all locks should be released, but the Process
> object should be in a state whereby it cannot be changed (I.e. certain
> functions are forbidden).
> 
>> Another thing we have to watch out for is that there are limitations on
>> things that can be done on the private state thread. For example you
>> wouldn't want to try to evaluate an expression (run the target) while you
>> are currently running the target for another expression. There is special
>> code for this for one special case (calling "mmap" to allocate memory by
>> manually calling a function without using the expression parser), but
>> those don't use any of the special locking we have implemented.
> 
> Since evaluating an expression can modify the process (for example
> 'expression -- i++') it should be treated as non-const unless some IR
> interpreter can guarantee otherwise (in a future optimization utopia.) It
> seems safe to turn the process m_in_transition to be true during
> evaluation of any expression, thereby preventing anything else from
> modifying or reading from Process. Of course, internally, the
> expression-evaluation would need to have it's own UncheckedResume()
> functionality that resumes the inferior without checking m_in_transition
> like the API Resume() would.
> 
> 
>>> Another part that complicates reviewing this code is the external
>>> locking
>>> -- wouldn't it be way better to encapsulate all the locking/unlocking
>>> inside the Process class? I like code that keeps the lock code close to
>>> the data they protect; what's the benefit of having SB* API classes
>>> acquire and maintain internal Process locks?
>> 
>> You might need to do multiple things. Like to get a backtrace for all
>> threads. Do you want to ask the process for thread at index 0, then
>> backtrace it. Then another thread resumes the process and it stops
>> somewhere else. Now you ask for thread at index 1 and you get a different
>> thread because the thread list has changed. Not a great experience. This
>> is why we externally lock.
> 
> Agreed; it would be a terrible bug.
> 
>>> I'd prefer to see this whole Process locking thing redesigned
>>> with internal rather than external locking. I think this will help nail
>>> down exactly what cases are causing the double-locking thing to
>>> deadlock.
>> 
>> This works great for single access functions, but falls down for complex
>> actions (like backtracking all threads).
> 
> I don't think external locking is the only way to accomplish this. What's
> wrong with a Process::BacktraceAllThreads() (or even
> Process::BacktraceSomeThreads(ThreadIDs)) function that internally
> acquires the run lock and does its business without fear of somebody
> continuing the process in the meantime? In BacktraceAllThreads(), if the
> process is in transition, that function should just fail. If the process
> is stopped, any other non-const calls (to Resume() or
> expression-evaluation) should block until BacktraceAllThreads() returns,
> or error out. Glancing at the code that calls GetRunLock() in the SB
> layer, it seems like it all belongs in the Process class. I understand the
> hesitation to add more stuff to Process, since the class is already
> getting kinda large, but I think we should worry about splitting it up
> into more more bite-sized components after this deadlock business is
> sorted out.
> 
> 
>> We are open to any ideas you have for robust locking. We have tossed
>> around a few ideas for the process control, but haven't come up with one
>> that works better than what we have. Having a multi-threaded API sure
>> does complicate things and is something other debuggers don't have to
>> worry about.
> 
> Definitely does complicate things just a smidge ;) I suppose it's too late
> to require that Xcode uses lldb in a more disciplined single-threaded
> mode? ;) j/k

Feel free to make a patch with your suggested changes and send it out and we can try it out on our stuff over here and maybe this will help us evaluate it better. 

You might then try to compile your new LLDB.framework on MacOSX and then run Xcode with it:

DYLD_FRAMEWORK_PATH=/path/to/build/Debug /Applications/Xcode.app/Contents/MacOS/Xcode

This will run Xcode with your new LLDB and allow you to test it out to make sure it all works.