[lldb-dev] Lldb-server spins forever in ptrace with 100% CPU on Linux Ubuntu 16.04

Mon Jan 9 02:53:45 PST 2017

Nice catch. I think this backtrace should be enough to bring the problem to
the attention of the kernel developers. Have you tried contacting them
already? I can help you with that if you want.

On 6 January 2017 at 23:34, Eugene Birukov <eugenebi at hotmail.com> wrote:

> Hi,
>
>
> It seems the problem is related to transparent huge pages. Disabling them
> helps.
>
>
> The LLDB spins in this code:
>
>
> [11428.634820] Call Trace:
> [11428.634821]  [<ffffffff811ec7de>] do_huge_pmd_wp_page+0x62e/0xb80
> [11428.634822]  [<ffffffff811b0f75>] handle_mm_fault+0x705/0xfe0
> [11428.634823]  [<ffffffff811aa34d>] ? follow_page_mask+0x37d/0x7a0
> [11428.634824]  [<ffffffff811aa90b>] __get_user_pages+0x19b/0x660
> [11428.634825]  [<ffffffff811ab222>] get_user_pages+0x52/0x60
> [11428.634825]  [<ffffffff811ac00b>] __access_remote_vm+0xcb/0x1f0
> [11428.634826]  [<ffffffff811b2360>] access_process_vm+0x50/0x70
> [11428.634827]  [<ffffffff8109500a>] ptrace_request+0x2da/0x620
> [11428.634828]  [<ffffffff810c24dc>] ? wait_task_inactive+0x16c/0x1f0
> [11428.634829]  [<ffffffff81038cab>] arch_ptrace+0x29b/0x300
> [11428.634830]  [<ffffffff81094c6e>] SyS_ptrace+0xbe/0x110
> [11428.634831]  [<ffffffff816967c9>] system_call_fastpath+0x16/0x1b
>
> Eugene
>
>
> Sent from Outlook <http://aka.ms/weboutlook>
>
>
> ------------------------------
> *From:* Pavel Labath <labath at google.com>
> *Sent:* Tuesday, December 13, 2016 1:39 AM
>
> *To:* Eugene Birukov
> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100%
> CPU on Linux Ubuntu 16.04
>
> Yeah, we support a couple of communication transports. For example, you
> can switch to using unix sockets instead. It's easiest if you just connect
> the server and client manually (for automatically starting up the server
> you would have to edit the source, I think.)
>
> $ bin/lldb-server gdbserver unix-abstract:///tmp/X -- /bin/ls
>
> $ bin/lldb
> (lldb) process connect unix-abstract-connect:///tmp/X
> Process 21731 stopped
> * thread #1, name = 'ls', stop reason = signal SIGSTOP
>     frame #0: 0x00007f4b341bb2d0
> ->  0x7f4b341bb2d0: movq   %rsp, %rdi
>     0x7f4b341bb2d3: callq  0x7f4b341bea40
>     0x7f4b341bb2d8: movq   %rax, %r12
>     0x7f4b341bb2db: movl   0x221b17(%rip), %eax
>
>
> Good luck. Let me know what you find out.
> pl
>
>
> On 12 December 2016 at 18:11, Eugene Birukov <eugenebi at hotmail.com> wrote:
>
>> OK, thanks!
>>
>> I'll work on getting reliable repro on something I can attach kgdb to.
>>
>>
>> Meanwhile - is there any options in lldb-server communication protocol I
>> could play with? Say, using pipe instead of tcp/ip or vice versa?
>>
>>
>> Sent from Outlook <http://aka.ms/weboutlook>
>>
>>
>> ------------------------------
>> *From:* Pavel Labath <labath at google.com>
>> *Sent:* Friday, December 9, 2016 2:56 AM
>>
>> *To:* Eugene Birukov
>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100%
>> CPU on Linux Ubuntu 16.04
>>
>> It looks like debugging this remotely will not lead anywhere. I don't
>> think I have enough knowledge about the kernel to help you with that. If I
>> were able to reproduce it myself, maybe I'd get somewhere by random
>> experiments, but doing it over the email is too slow. :)
>>
>> I suggest you ask some of the kernel developers about it. They should be
>> able to direct you where to look further, and may be aware of kernel
>> changes between the two versions that would affect this.
>>
>> cheers,
>> pl
>>
>> On 8 December 2016 at 17:12, Eugene Birukov <eugenebi at hotmail.com> wrote:
>>
>>> *> **From your description is sounds like un-predicability comes from
>>> the tracee side*
>>>
>>>
>>> Not exactly. If problem is there, it repro's 100%. Things that affect
>>> repro - make it go away or shift to different address:
>>>
>>>
>>>
>>>    1. Delays in the client. If I put ::sleep(0) right after sending
>>>    write memory request (not sure - are we writing in pipe or in socket?) - it
>>>    might kill repro.
>>>    2. Changes in VM. Copying all bits into would be identical VM hurts
>>>    repro.
>>>    3. Changes in the target application. I assume this is mainly layout
>>>    of different VMA that shift after significant code change because target is
>>>    fully stopped and it is hard to believe that the contents of the memory
>>>    matters... Just in case - I tried to write different values and change
>>>    write size to 8 (not to trigger the recursion in llldb-server).
>>>
>>> Things that do not matter:
>>>
>>>
>>>    - Delays in the server. As soon as we are about to call ptrace, no
>>>    amount of delays caused by stepping, etc. matters, the damage is already
>>>    there.
>>>
>>> So, it seems that the problem is really not in ptrace but in the
>>> communication channel between the client and the server. Ptrace looks
>>> like just an innocent bystander caught in the gang crossfire on the street.
>>>
>>>
>>> BTW, I am on short vacation, will be back on Monday 12/12.
>>>
>>>
>>> Sent from Outlook <http://aka.ms/weboutlook>
>>>
>>>
>>> ------------------------------
>>> *From:* Pavel Labath <labath at google.com>
>>> *Sent:* Thursday, December 8, 2016 3:19 AM
>>> *To:* Eugene Birukov
>>>
>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with 100%
>>> CPU on Linux Ubuntu 16.04
>>>
>>> Thanks for the info. I agree with your assessment, it definitely looks
>>> like a kernel bug.
>>>
>>> I don't know how to trigger a kernel core dump. I have done a bit of
>>> kernel debugging/development, but it mainly consisted of adding a bunch of
>>> printf's and watching the dmesg output. I think we should contact the linux
>>> kernel mailing list (LKML.org) with this problem. I've had pretty good
>>> success with this in the past. The one contact I have there is Oleg
>>> Nesterov <oleg at redhat.com>. He's quite a ptrace expert, and was able to
>>> fix my bug in a day when I contacted him with a not-too-unsimilar problem
>>> to yours (although this does sound more complicated, I suspect some
>>> interaction between ptrace and virtual memory manager).
>>>
>>> Before we do that, we may want to reduce the scope of the problem a bit.
>>> From your description is sounds like un-predicability comes from the tracee
>>> side, and once it gets to a bad state, all tracer writes to the affected
>>> area will hang. In that case, it should be possible to take lldb out of the
>>> equation, and instead write a trivial app, which takes a pid and an
>>> address, and does a PTRACE_ATTACH + a bunch of PTRACE_POKEDATA to the given
>>> address. I think this would be enough to get kernel folks interested, as it
>>> clearly demonstrates it's not lldb being funny, and they may direct us how
>>> to debug this further. I have adapted one of my test apps to do this
>>> (attached). Could you try this out on your system?
>>>
>>> pl
>>>
>>>
>>>
>>>
>>> On 8 December 2016 at 00:57, Eugene Birukov <eugenebi at hotmail.com>
>>> wrote:
>>>
>>>> Some additional info.
>>>>
>>>>
>>>> I stepped through individual instructions… It hangs in syscall
>>>>
>>>>
>>>>
>>>> 37      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee21b <ptrace+43>:      48 63 70 08     movslq
>>>> 0x8(%rax),%rsi
>>>>
>>>> (gdb)
>>>>
>>>> 38      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee21f <ptrace+47>:      48 8b 50 10     mov
>>>> 0x10(%rax),%rdx
>>>>
>>>> (gdb)
>>>>
>>>> 45      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee223 <ptrace+51>:      89 ff   mov    %edi,%edi
>>>>
>>>> (gdb)
>>>>
>>>> 36      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee225 <ptrace+53>:      48 89 44 24 c8  mov
>>>> %rax,-0x38(%rsp)
>>>>
>>>> (gdb)
>>>>
>>>> 45      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee22a <ptrace+58>:      4c 0f 47 50 18  cmova
>>>> 0x18(%rax),%r10
>>>>
>>>>    0x00007fcac11ee22f <ptrace+63>:      b8 65 00 00 00  mov
>>>> $0x65,%eax
>>>>
>>>> (gdb)
>>>>
>>>> 0x00007fcac11ee22f      45      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>>    0x00007fcac11ee22a <ptrace+58>:      4c 0f 47 50 18  cmova
>>>> 0x18(%rax),%r10
>>>>
>>>> => 0x00007fcac11ee22f <ptrace+63>:      b8 65 00 00 00  mov
>>>> $0x65,%eax
>>>>
>>>> (gdb)
>>>>
>>>> 38      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee234 <ptrace+68>:      c7 44 24 b8 18 00 00 00 movl
>>>> $0x18,-0x48(%rsp)
>>>>
>>>> (gdb)
>>>>
>>>> 45      in ../sysdeps/unix/sysv/linux/ptrace.c
>>>>
>>>> => 0x00007fcac11ee23c <ptrace+76>:      0f 05   syscall
>>>>
>>>>    0x00007fcac11ee23e <ptrace+78>:      48 3d 00 f0 ff ff       cmp
>>>> $0xfffffffffffff000,%rax
>>>>
>>>>    0x00007fcac11ee244 <ptrace+84>:      77 2a   ja     0x7fcac11ee270
>>>> <ptrace+128>
>>>>
>>>> (gdb)
>>>>
>>>> Also, user vs. kernel times are about 1:1000 ratio according to
>>>> proc/pid/stat:
>>>>
>>>>
>>>> # cat /proc/21809/stat
>>>>
>>>> 21809 (lldb-server) R 21780 21809 21809 0 -1 4194304 1175 1709 0 0 18
>>>> 18398 4 0 20 0 1 0 1602761 48611328 2102 18446744073709551615 4194304
>>>> 11143636 140733729492144 140733729487496 140187105837630 262144 65537 4096
>>>> 65537 0 0 0 17 1 0 0 0 0 0 13244688 13945048 29155328 140733729499314
>>>> 140733729499429 140733729499429 140733729501128 0
>>>>
>>>> Eugene
>>>>
>>>>
>>>> Sent from Outlook <http://aka.ms/weboutlook>
>>>>
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Eugene Birukov <eugenebi at hotmail.com>
>>>> *Sent:* Wednesday, December 7, 2016 1:15 PM
>>>> *To:* Pavel Labath; Eugene Birukov
>>>>
>>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with
>>>> 100% CPU on Linux Ubuntu 16.04
>>>>
>>>>
>>>> And yes - I suspect kernel bug because we did not see it on 15.10 but
>>>> on 16.04 I have several people reporting it. In one case it would repro on
>>>> 3 out of 4 kind of "identical" machines.
>>>>
>>>>
>>>> I do not have simple repro, but I can find victim VM that we do not
>>>> care much about. Is there a way to cause Linux kernel core dump there?
>>>>
>>>>
>>>> Sent from Outlook <http://aka.ms/weboutlook>
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* lldb-dev <lldb-dev-bounces at lists.llvm.org> on behalf of Eugene
>>>> Birukov via lldb-dev <lldb-dev at lists.llvm.org>
>>>> *Sent:* Wednesday, December 7, 2016 11:36 AM
>>>> *To:* Pavel Labath
>>>> *Cc:* LLDB
>>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with
>>>> 100% CPU on Linux Ubuntu 16.04
>>>>
>>>>
>>>> *> could you get a backtrace of lldb-server when it is in the "stuck"*
>>>>
>>>> *state (just attach with lldb/gdb after it hangs and do "bt")?*
>>>>
>>>>
>>>> You wish [image: ☹] The lldb-server does not react to any signals
>>>> including SIGSTOP, so gdb just hangs forever.
>>>>
>>>>
>>>> *> If you can get me reasonably detailed repro steps, I can try to *
>>>> *investigate*
>>>>
>>>>
>>>> Unfortunately I do not have repro myself. It happens semi-randomly on
>>>> some machines and I need to borrow the machine with the problem. Here are
>>>> some details from my records:
>>>>
>>>>
>>>>    - It is pretty big piece of RX memory, /proc/<pid>/maps shows this:
>>>>    409701000-40b49c000 r-xp 00000000 00:00 0
>>>>    - Writing into some locations within that VMA works
>>>>    - When it repros, it is pretty consistent, but changing in the
>>>>    target may shift it - i.e. make no repro or fail at different address.
>>>>
>>>> *> **are you able to still reproduce the bug with logging enabled?*
>>>>
>>>>
>>>> Yes. Here are a few last lines from the log:
>>>>
>>>> 1481139040.768961000 0x7fff253c9780 Communication::Write (src =
>>>> 0x7fff253c8f48, src_len = 7) connection = 0x24a6bd0
>>>> 1481139040.768963000 0x24a6bd0 ConnectionFileDescriptor::Write (src =
>>>> 0x7fff253c8f48, src_len = 7)
>>>> 1481139040.768973000 0x24a6cc0 Socket::Write() (socket = 6, src =
>>>> 0x7fff253c8f48, src_len = 7, flags = 0) => 7 (error = (null))
>>>> 1481139040.768976000 0x24a6bd0 ConnectionFileDescriptor::Write(fd = 6,
>>>> src = 0x7fff253c8f48, src_len = 7) => 7 (error = (null))
>>>> 1481139040.768979000 0x7fff253c9780 Communication::Read (dst =
>>>> 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0
>>>> 1481139040.768982000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable
>>>> (timeout_usec = 0)
>>>> 1481139040.768984000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable()
>>>>  ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)...
>>>> 1481139040.768986000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable()
>>>>  ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 0,
>>>> error = (null)
>>>> 1481139090.788317000 0x7fff253c9780 Communication::Read (dst =
>>>> 0x7fff253c7140, dst_len = 8192, timeout = 0 usec) connection = 0x24a6bd0
>>>> 1481139090.788356000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable
>>>> (timeout_usec = 0)
>>>> 1481139090.788364000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable()
>>>>  ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80)...
>>>> 1481139090.788368000 0x24a6bd0 ConnectionFileDescriptor::BytesAvailable()
>>>>  ::select (nfds=7, fds={6, 4}, NULL, NULL, timeout=0x7fff253c6d80) => 1,
>>>> error = (null)
>>>> 1481139090.788378000 0x24a6cc0 Socket::Read() (socket = 6, src =
>>>> 0x7fff253c7140, src_len = 25, flags = 0) => 25 (error = (null))
>>>> 1481139090.788382000 0x24a6bd0 ConnectionFileDescriptor::Read()  fd =
>>>> 6, dst = 0x7fff253c7140, dst_len = 8192) => 25, error = (null)
>>>> 1481139090.788395000 NativeProcessLinux::WriteMemory(0x409d5a7d0,
>>>> 0x25271d0, 4)
>>>> 1481139090.788409000 NativeProcessLinux::ReadMemory using
>>>> process_vm_readv to read 8 bytes from inferior address 0x409d5a7d0: Success
>>>> 1481139090.788414000 PTRACE_POKEDATA [1][0][0][0][57][41][54][41]
>>>>
>>>> Thanks,
>>>>
>>>> Eugene
>>>>
>>>>
>>>> Sent from Outlook <http://aka.ms/weboutlook>
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* Pavel Labath <labath at google.com>
>>>> *Sent:* Wednesday, December 7, 2016 2:34 AM
>>>> *To:* Eugene Birukov
>>>> *Cc:* LLDB
>>>> *Subject:* Re: [lldb-dev] Lldb-server spins forever in ptrace with
>>>> 100% CPU on Linux Ubuntu 16.04
>>>>
>>>> Hello Eugene,
>>>>
>>>> this sounds troubling, and I'd like to get to the bottom of this. If
>>>> you can get me a bit more information about this, I believe we can
>>>> figure it out:
>>>>
>>>> - could you get a backtrace of lldb-server when it is in the "stuck"
>>>> state (just attach with lldb/gdb after it hangs and do "bt")? I want
>>>> to see the where is it spinning, as I don't see any obvious infinite
>>>> loop there.
>>>>
>>>> - are you able to still reproduce the bug with logging enabled? If so,
>>>> I'd like to see the log file to understand this better. (You can
>>>> enable logging by starting lldb-server with: --log-file XXX.log
>>>> --log-channels "lldb all:linux all". If you're starting it via lldb
>>>> client you can set the LLDB_DEBUGSERVER_LOG_FILE and
>>>> LLDB_SERVER_LOG_CHANNELS environment vars to achieve this)
>>>>
>>>> - If you can get me reasonably detailed repro steps, I can try to
>>>> investigate (I am fine with the first step being "install ubuntu 16.04
>>>> in virtualbox")
>>>>
>>>> On 6 December 2016 at 23:41, Eugene Birukov via lldb-dev
>>>> <lldb-dev at lists.llvm.org> wrote:
>>>> > Hi,
>>>> > 1. I believe that lldb-server spins inside ptrace. I put breakpoint
>>>> on the
>>>> > highlighted line, and it does not hit. If I put breakpoint on line
>>>> before,
>>>> > it hits but lldb-server hangs.
>>>>
>>>> Do you mean actually inside the ptrace(2) syscall? Your description
>>>> would certainly fit that, but that sounds scary, as it would mean a
>>>> kernel bug. If that's the case, then we have to start looking in the
>>>> kernel. I have some experience with that, but If we can boil this down
>>>> to a simple use case, we can also ask the kernel ptrace folks for
>>>> help.
>>>>
>>>>
>>>> > 2. It seems that hang is caused by the client trying to read response
>>>> too
>>>> > fast. I mean, if I step through the client code it works - i.e. there
>>>> is
>>>> > significant delay between client writing into pipe and issuing
>>>> ::select to
>>>> > wait for response.
>>>>
>>>> I am not sure how this fits in with the item above. I find it hard to
>>>> believe that the presence of select(2) in one process would affect the
>>>> outcome of ptrace() in another. Unless we are actually encountering a
>>>> kernel scheduler bug, which I find unlikely. Hopefully we can get more
>>>> insight here with additional logging information.
>>>>
>>>>
>>>> > Any advice how to deal with the situation except putting random
>>>> sleeps in
>>>> > random places?
>>>> Inserting sleeps in various places is a valid (albeit very slow)
>>>> strategy for debugging races. If you can push the sleep down, you will
>>>> eventually reach the place where it will be obvious what is racing
>>>> (or, at least, which component is to blame). Hopefully we can do
>>>> something smarter though.
>>>>
>>>> If you are suspecting a kernel bug, I'd recommend recreating it in a
>>>> simple standalone application (fork, trace the child, write its
>>>> memory), as then it is easy to ask for help
>>>>
>>>> pl
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20170109/e31c56a7/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OutlookEmoji-☹.png
Type: image/png
Size: 506 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20170109/e31c56a7/attachment-0001.png>