[lldb-dev] Locking issues

Thu Apr 18 11:05:10 PDT 2013

Hi Greg, thanks for the explanation.

One part I am not sure of is where OperatingSystemPython uses the SB
APIs.. You're referring to user-code, right? I am not seeing anything in
the OperatingSystemPython.[cpp|h] (even the update thread list function)
that uses SB* stuff. To confirm my understanding, it seems that "regular"
python code (like the test cases that drive the debugger) which uses SB*
ends up locking the "public" locks whereas "callback" code that I imagine
gets invoked via OSPython ends up locking the "private" locks -- is this
really true? Are these two use-cases really running on different (main vs.
non-main) threads?

If so, once the locks are acquired, don't both of these use-cases end up
using the *same* shared Process data? Is there any coupling between the
data that the locks are protecting and the locks themselves? Based on the
Process::GetRunLock() it seems like an arbitrary lock is chosen to protect
all of Process' private data..

If the public/public run locks *are* coupled to the public/private states
of the process respectively, why is it that the both write locks need to
be kept for long periods of time? I can understand the approach of having
an internal (private) "run" write lock being held while the process is
running, but I'm having trouble conceptualizing why the public run lock
needs to be held for similar durations. It seems any synchronization we do
for the public state should be very short-held "critical section" type
locks while functions query the last-known process state, or just return
an error because the process is running. Any SB function that results in a
public lock being held for longer than the duration of that function seems
like a bug in the SB code -- does that make sense or am I way off?

Another question that arises is why does python code need to run on the
same main internal thread? I would see SB* callers as users of the Process
class and as such they should only be allowed to touch the "public" state
and never run on the internal thread. I would imagine that LLDB would
spawn new threads for callbacks instead of reusing internal monitoring
threads. Is it not possible to have both "driving" code and "callback"
code go through exactly the same code path the SB* layer as well as the
Process class? My thinking is that both of those use cases should be
treated the same and interact with only the "public" Process state, while
an internal thread does the lower level Process work of launching things
and responding to events from the process, etc.

Another part that complicates reviewing this code is the external locking
-- wouldn't it be way better to encapsulate all the locking/unlocking
inside the Process class? I like code that keeps the lock code close to
the data they protect; what's the benefit of having SB* API classes
acquire and maintain internal Process locks?

I agree completely about avoiding Boost-bloat; the concern I had there was
that there's no tests for any of the R/W locking support classes.. I
concede that on further studies of the Boost APIs, they won't really solve
the double-locking or double-unlocking problem you're trying to solve --
although I'd prefer to see this whole Process locking thing redesigned
with internal rather than external locking. I think this will help nail
down exactly what cases are causing the double-locking thing to deadlock.

Thanks for trying to get a Linux machine up and running; we definitely
appreciate it!!

I have never tried getting a build working in Parallels though, so you
might have a better time with Bootcamp. The segfault during linking
definitely sounds like an out-of-memory situation. As for Ubuntu, 12.04 is
a little out-dated now; I recommend using 12.10 as that has GCC 4.7 which
is compatible with both configure and cmake-based builds. If you do need
to use 12.04 and you're stuck on GCC 4.6, I recommend using cmake instead
of configure as that system is smart enough to pass the right C++11 flag
(-std=c++0x) to older versions of GCC as opposed to ./configure
--enable-cxx11 which blindly passes in (-std=c++11). Also, cmake+ninja
builds LLDB much faster (by about 1min) than the configure based system.

As Ashok mentioned, if you'd like to use clang, llvm.org/apt is the place
to get bleeding edge packages for Ubuntu and Debian. Personally, I use
both GCC and Clang to keep things sane across the board. Both buildbots
seem to build the project right now.a

Cheers,
Dan

On 2013-04-17 8:25 PM, "Greg Clayton" <gclayton at apple.com> wrote:

>We currently need to avoid doing things while the process is running.
>There are two cases we care about:
>- the public state tracking when we are running
>- the private state tracking when we are running
>
>The main reason we need this is the private process state thread handles
>some complex things for us when it is handling the process. One example
>is the OperatingSystemPlugins (like OperatingSystemPython) where it may
>get called from the private process state thread to update the thread
>list. A common thing to do in the OperatingSystemPython is to read a
>global list in the kernel that contains the thread list and follow a
>linked list. If we run and need to determine if we should stop, we often
>need to update our thread list. This update will happen on the private
>process thread. So the flow goes like this:
>
>The old problem was:
>
>1 - (main thread) user says "step over"
>2 - (main thread) initiates the process control and the public process
>write lock is taken
>3 - (private process thread) run and stop after each "trace" while doing
>the single step
>4 - (private process thread) updates the thread list which calls into the
>OperatingSystemPython which wants to use the public LLDB API
>5 - (private process thread) goto 3 until step is done
>
>The problem is step 4 fails because the OperatingSystemPython used
>lldb::SB API's that require the public process write lock in order to
>evaluate expressions and use anything that requires that the process is
>stopped.
>
>To get around this we introduced the private read/write process lock to
>track when the process state thread is stopped so we can actually use the
>public APIs. So the flow is now:
>
>1 - (main thread) user says "step over"
>2 - (main thread) initiates the process control and the public process
>write lock is taken
>3 - (private process thread) lock private process write lock
>4 - (private process thread) run and stop after each "trace" while doing
>the single step
>5 - (private process thread) unlock private process write lock
>6 - (private process thread) updates the thread list which calls into the
>OperatingSystemPython which wants to use the public LLDB API
>7 - (private process thread) goto 3 until the step is done
>
>This lets us use the public APIs by allowing the private process state
>thread to lock a different lock and manage when the private state thread
>is locked.
>
>This is a problem for other things that use python during the lifetime of
>the process. For instance, we want to eventually have some python code
>that gets called when a process is about the resume, or just after it
>stops. We would like to simplify the code for breakpoints that have
>commands that get run when the breakpoint is hit (right now we defer any
>actions until the user consumes the public stop event).
>
>
>> You mentioned that you'll improve the R/W (scoped?) locking classes..
>>Any
>> reason to not use boost (or some other C++11 library) for this? If we do
>> have to roll our own in LLDB, the lack of tests is worrisome.
>
>I am not a big fan of boost as it bloats the C++ program debug info to be
>so large that it often makes debugging the boost programs very difficult
>due to the shear size of the debug info. Most of what we cared about from
>boost is now in C++11. Even if we did use boost, would it actually check
>to see if the lock was taken prior to trying to release it? The APIs on
>read/write locks are dead simple, so I don't see this is a reason to use
>boost.
>
>> If the improvements to the R/W locker classes you've got in progress
>>don't
>> allow the test suite to run to completion, could you please revert
>>179329
>> until we have something that allows us to run the tests? Lots of patches
>> are backed up atm due to the LLVM practice of not committing on top of a
>> broken trunk.
>
>Yes, I am trying to get us access to a linux machine that we can all use
>here at Apple so we can debug and fix the things we break.
>
>I spent a large part of the weekend trying to get Ubuntu 12.04 (using
>Parallels Desktop (virtualization software)) building llvm/clang/lldb so
>that I can fix these issues. I wasn't able to get clang to build as the
>link stage would always get killed with a signal 9. Not sure why, maybe
>the virtualization software was running out of RAM or resources. The
>build instructions up on the web for Linux don't actually work on a fresh
>install of Ubuntu. I needed to install new packages for tools essentials
>and also install gcc-4.7 and try to figure out how to get LLVM to use
>these compilers to get things to build with C++11, otherwise the build
>wouldn't even configure with gcc-4.6 due to the --enable-libcpp quickly
>stating of of the options wasn't supported by the compiler.
>
>So the linux builds are frustrating to try and get working, but I do want
>everyone to know that I am trying.
>
>What compiler do you build with on linux? Are there packages to install
>for a suitable version of clang? I finally gave up after many many hours
>of trying to get lldb to build.
>
>Greg
>
>> 
>> 
>> Dan
>> 
>> PS. The hanging buildbots to watch are:
>> 
>> http://lab.llvm.org:8011/builders/lldb-x86_64-darwin11/builds/1890
>> http://lab.llvm.org:8011/builders/lldb-x86_64-debian-clang
>> 
>> http://lab.llvm.org:8011/builders/lldb-x86_64-linux
>> 
>> 
>> On 2013-04-17 12:47 PM, "Greg Clayton" <gclayton at apple.com> wrote:
>> 
>>> 
>>> On Apr 17, 2013, at 1:27 AM, Carlo Kok <ck at remobjects.com> wrote:
>>> 
>>>> I'm trying to update the Windows branch to the latest and greatest and
>>>> found these locking issues (not sure if they're relevant for posix
>>>>too):
>>>> 
>>>> When I attach a process (I only use the gdb remote) the first even I
>>>> get is "stopped" which tries to unlock m_private_run_lock, however
>>>>this
>>>> one is never locked in the first place. Windows' writelock doesn't
>>>> appreciate that; as a workaround I added a
>>>> m_private_run_lock.WriteLock(); in Process' constructor, which seems
>>>>to
>>>> fix that.
>>> 
>>> We need to fix this better by locking the private run lock when
>>>attaching
>>> if all goes well.
>>> 
>>>> 
>>>> The second issue occurs when when trying to cause a "Stop" when it's
>>>> already paused on internal breakpoints; for me this is during slow
>>>> symbol load. When happens is that the loading (which happens from
>>>>within
>>>> Process::ShouldBroadcastEvent) resumes it, then the process exits
>>>> properly (triggers the ShouldBroadcastEvent again) however:
>>>> 
>>>> ProcessEventData::DoOnRemoval(lldb_private::Event * event_ptr)
>>>> called by Listener::FindNextEventInternal.
>>>> 
>>>> The resume call is in this condition:
>>>> if (state != eStateRunning)
>>> 
>>> Where is the above "if (state != eStateRunning)"?
>>> 
>>>> Changing that to:
>>>> lldb::StateType state = m_process_sp->GetPrivateState();
>>>> if (state != eStateRunning && state != eStateCrashed && state !=
>>>> eStateDetached && state != eStateExited)
>>> 
>>> There are functions that indicate if the function is stopped or
>>>running.
>>> We should use those functions. (search for "StateIsStopped").
>>> 
>>>> 
>>>> Seems to fix it, as there's no reason to try & resume a process that's
>>>> not running in the first place (and since exiting doesn't unlock a
>>>> process this causes a deadlock)
>>>> 
>>>> The last issue is this:
>>>> void * Process::RunPrivateStateThread ()
>>>> does : m_public_run_lock.WriteUnlock(); when it's done. The Finalize
>>>> also unlocks that same lock, which Windows crashes on.
>>>> commenting that out and it seems to work stable.
>>> 
>>> We need to build in some smarts into our Read/Write locking class to
>>>know
>>> if the read/write lock is taken and only unlock if the corresponding
>>> read/write lock is locked. I will make this change today.
>>> 
>>>> 
>>>> 
>>>> Anyone see any issues in all of this? (might make sense to apply this
>>>> to trunk too; it's never good to have unbalanced lock/unlocks)
>>>> _______________________________________________
>>>> lldb-dev mailing list
>>>> lldb-dev at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev
>>> 
>>> _______________________________________________
>>> lldb-dev mailing list
>>> lldb-dev at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/lldb-dev
>> 
>