[Lldb-commits] [lldb] Fix a bug with cancelling "attach -w" after you have run a process previously (PR #65822)

Wed Sep 20 13:41:06 PDT 2023

jimingham wrote:

> On Sep 20, 2023, at 1:39 PM, Jim Ingham ***@***.***> wrote:
> 
> Interesting...  I split the end stage of the test up into:
> 
> (a) wait for up to 20 seconds the state to switch to eStateExited
> (b) wait another second, then check the command output file
> 
> On macOS, as I said, that seemed to make the test quite robust, but when I checked the test in I got a bot failure at stage (a).

That is, a Linux bot failure...

Jim

>  That really sounds like the underlying feature of interrupting a process in this state doesn't work reliably on those systems.  The patch this is testing just changed a bug in generic code that prevented us from TRYING to do this interrupt, but didn't change any of the underlying process interrupt machinery.  So it looks like the test is uncovering underlying flakiness.
> 
> I changed the test to be macOS only for now.  We should go back and make the feature work reliably on the other platforms when we get a change.
> 
> Jim
> 
> 
>> On Sep 20, 2023, at 1:14 PM, Jim Ingham ***@***.***> wrote:
>> 
>> You are right, there is a racy bit here - I was also getting ~one fails per 50 runs.  
>> 
>> The raciness is due to the time it takes between when we send the interrupt and when we finish up the command and produce the results.  So we can end up reading from the command output before the command has written its result.
>> 
>> I can make this more deterministic by waiting for the process state to shift to eStateExited.   But even with that it was still a little flakey because there's still some time between when the interrupt is noticed and causes us to switch the process state and when lldb makes the command return and prints that to stdout.
>> 
>> Unfortunately this is a test of command-line behavior, you don't get the same problem with the SB API's directly.  So I have to do it as a "run a command-line command" test, and that makes it harder to get direct signals about when to look at the result.
>> 
>> If we sent "command completed" events I could wait for that, then the test would be deterministic.  That seems like a feature we should design, not jam in to fix a test problem..  
>> 
>> However, if I just insert a time.sleep(1) between seeing the state flip to eStateExited and looking at the results, I can run this #200 and see no failures.  That's just the time to back out of the command execution stack and write the result to stdout, so that shouldn't be as variable.  If this still ends up failing intermittently, then I'll have to cook up some kind of "command completed" event that the test can wait on.
>> 
>> Jim
>>  
>> 
>> 
>>> On Sep 20, 2023, at 11:05 AM, Jim Ingham ***@***.***> wrote:
>>> 
>>> The way the test works, we run one real process and let it exit.  Then we do "attach -w -n noone_would_use_this_name" because we don't want the second attach attempt to be able to succeed.  lldb should just stay stuck till the interrupt succeeds, there shouldn't be anything other way to get out of this.  It sounds like something else is kicking us out of the attach wait loop on these systems?
>>> 
>>> Jim
>>> 
>>> 
>>> 
>>>> On Sep 20, 2023, at 2:56 AM, David Spickett ***@***.***> wrote:
>>>> 
>>>> 
>>>> @DavidSpickett commented on this pull request.
>>>> 
>>>> In lldb/test/API/commands/process/attach/TestProcessAttach.py <https://github.com/llvm/llvm-project/pull/65822#discussion_r1331380724>:
>>>> 
>>>> > +            time.sleep(1)
>>>> +            if target.process.state == lldb.eStateAttaching:
>>>> +                break
>>>> +
>>>> +        self.dbg.DispatchInputInterrupt()
>>>> +        self.dbg.DispatchInputInterrupt()
>>>> +
>>>> +        self.out_filehandle.flush()
>>>> +        reader = open(self.stdout_path, "r")
>>>> +        results = reader.readlines()
>>>> +        found_result = False
>>>> +        for line in results:
>>>> +            if "Cancelled async attach" in line:
>>>> +                found_result = True
>>>> +                break
>>>> +        self.assertTrue(found_result, "Found async error in results")
>>>> This fails if results is empty.
>>>> 
>>>> ********************************
>>>> []
>>>> ********************************
>>>> FAIL: LLDB (/home/david.spickett/build-llvm-aarch64/bin/clang-aarch64) :: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase)
>>>> <bound method SBProcess.Kill of SBProcess: pid = 0, state = exited, threads = 0, executable = ProcessAttach>: success
>>>> 
>>>> Restore dir to: /home/david.spickett/build-llvm-aarch64
>>>> ======================================================================
>>>> FAIL: test_run_then_attach_wait_interrupt (TestProcessAttach.ProcessAttachTestCase)
>>>> ----------------------------------------------------------------------
>>>> Traceback (most recent call last):
>>>>   File "/home/david.spickett/llvm-project/lldb/test/API/commands/process/attach/TestProcessAttach.py", line 195, in test_run_then_attach_wait_interrupt
>>>>     self.assertTrue(found_result, "Found async error in results")
>>>> AssertionError: False is not True : Found async error in results
>>>> Config=aarch64-/home/david.spickett/build-llvm-aarch64/bin/clang
>>>> ----------------------------------------------------------------------
>>>> When the test works the results contain:
>>>> 
>>>> ********************************
>>>> ['error: attach failed: Cancelled async attach.\n', '\n', '... Interrupted.\n']
>>>> ********************************
>>>> Running it in a loop it took ~40 runs to get a failing one.
>>>> 
>>>> I wonder if that is because the attach happens to finish a lot faster sometimes, so there's no time to cancel it? If that's down to the OS and machine load, I'm not sure how we'd make this predictable.
>>>> 
>>>> The ugly option is to say if the results are empty, pass the test and assume the other 39 runs will check for the bug.
>>>> 
>>>> —
>>>> Reply to this email directly, view it on GitHub <https://github.com/llvm/llvm-project/pull/65822#pullrequestreview-1635242071>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW225PO46OIV2JRC2I3X3K4WPANCNFSM6AAAAAA4RAWCQY>.
>>>> You are receiving this because you modified the open/close state.
>>>> 
>>> 
>> 
> 

https://github.com/llvm/llvm-project/pull/65822