[Lldb-commits] [PATCH] D13124: test runner: switch to pure-Python timeout mechanism
Todd Fiala via lldb-commits
lldb-commits at lists.llvm.org
Thu Sep 24 08:35:54 PDT 2015
tfiala added a comment.
In http://reviews.llvm.org/D13124#252564, @labath wrote:
> I don't want to stand in the way of progress (and I do think that getting rid of the timeout dependency is progress), but this implementation regresses in a couple of features compared to using timeout:
> - timeout tries (with moderate success) to cleanup children spawned by the main process. your implementation will (afaik) kill only the main process. This is especially important for build bots, since leaking processes will starve the bot's resources after a while (and we have had problems with this on our darwin build bot).
Fair enough. Either yesterday or the day before, I put up a change that ensures the dotest inferiors are in a separate process group. I could (on non-Windows) send that process group a SIGQUIT. The reason I prefer not to use SIGQUIT is that it is catchable, and thus can be ignored. Once something can be ignored, the runner needs to be able to handle the timed out child really not ending with a SIGQUIT. That means either more logic around timing out on the "wait for death", or possibly not reaping. In the case of a test runner, I'd probably put the focus on making sure the test runner makes it through ahead of getting the coredump from a process that times out.
That said, my initial implementation of this did do the following:
- Start with a SIGQUIT (via subprocess.Process.terminate(), so does something reasonable on Windows). Wait for 5 seconds to see if it wraps up (by way of waiting for the communicate() thread to die).
- If still not dead, send a SIGKILL (via subprocess.Process.kill()).
I thought the added complication wasn't worth it, but it's not that complicated. And would not regress behavior on the Linux side. The only diff would be, on non-Windows, send a kill to the process group (which has already been created for the inferior dotest), and just use the terminate() call on Windows.
I can give that a whirl. It's a bit more code but is not going to be too bad.
> - we intentionally had timeout terminate the child processes with SIGQUIT, because this produces core files of the terminated processes. I have found this very useful when diagnosing the sources of hanging tests. I'd like to have the possibility to do this, even if it will not be enabled by default.
> Do you think that the benefits of this change outweigh these issues? If so, and they don't cause our bots to break, then we can put it in (and i'll probably implement the code dumping feature when I find some time), but I though you should be aware of these downsides.
More information about the lldb-commits