[llvm-dev] [lit] RFC: Per test timeout

Sun Nov 15 08:09:22 PST 2015

On 11/14/15 9:37 AM, Jonathan Roelofs via llvm-dev wrote:
>
>
> On 11/14/15 3:10 AM, Dan Liew via llvm-dev wrote:
>> Hi,
>>
>> A feature I've wanted in lit for a while is a having a timeout per
>> test. Attached
>> are patches that implement this idea.
>
> Cool, I hope this succeeds. I tried implementing per-test timeouts
> before, and couldn't get it to work in all cases. The review eventually
> fizzled out, and I abandoned it.
>
> Here's that old review: http://reviews.llvm.org/D6584 Perhaps you can
> cannibalize testcases from it.
>
>>
>> I'm e-mailing llvm-dev rather than llvm-commits
>> because I want to gather more feedback on my initial implementation and
>> hopefully some answers to some unresolved issues with my implementation.
>>
>> Currently in lit you can set a global timeout for
>> all of the tests but not for each individual test.
>>
>> The attached patches add
>>
>> * Support for a new ResultCode called TIMEOUT
>
> When I implemented mine, one of the first revisions had this, but then
> @ddunbar said he'd prefer I didn't add a new ResultCode.
>
>> * A new command line option --max-individual-test-time
>
> I think you should call it `--timeout=`, and then say in the description
> that it's a per-test timeout.
>
>> * Support for running external and internal ShTests with a per test
>> timeout
>> * Support for running GTests with a per test timeout
>>
>> I wanted to get some initial feedback on the implementation.
>>
>> * If a timeout is requested the Python psutil module is required.
>>    This module does not ship with Python but is availble via pip
>>    (or on Linux your distribution's package manager). How do people feel
>>    about this? I don't like adding extra dependencies but this module
>>    makes it really easy to kill a process and all its children
>> recursively
>>    in a platform neutral way. Note that if a per test timeout is not
>> requested
>>    then the psutil module is not imported and lit acts just like it did
>> before my patches.
>
> This must be the missing piece... I couldn't get my implementation to
> work without resorting to Python 3.x features (which is incompatible
> with a 2.x minimum version).
>
>>
>> * If the platform running lit doesn't have psutil installed and a
>> timeout is requested
>>    an exception will be thrown. Should we provide a more friendly
>> error message?
>
> Yes.
>
>>    If so where should this go in lit's code?
>
> Not sure off the top of my head, but probably somewhere near the rest of
> the argument parsing stuff. I'll have a look later.
>
>>
>> * I've not tested these patches on OSX or Windows. Is anyone on those
>> platforms willing
>>    to give them a try?
>
> Yes, I'll give it a whirl on Darwin early next week.

I implemented the fixes I suggested, and my testcases from before. Works 
for me on Darwin.  See attached.


Cheers,

Jon

>
>
> Mind squashing your two patches, throwing them on Phabricator, and
> cc-ing llvm-commits?
>
>
> Thanks!
>
> Jon
>
>>
>> * The behaviour of --max-individual-test-time is a little at odds with
>> the behaviour of --max-time. --max-time will mark unexecuted tests as
>> UNRESOLVED whereas --max-individual-test-time will mark a test that
>> ran out of time as TIMEOUT. Is this okay?
>>
>> * @Chris Matthews. Does the code that emits xunit xml files need to
>> change in anyway?
>>
>> * @Daniel Dunbar. If these changes (in some form) end up being
>> committed would you be happy
>>    to push a new release of lit to PyPy?
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>

-- 
Jon Roelofs
jonathan at codesourcery.com
CodeSourcery / Mentor Embedded
-------------- next part --------------
Index: utils/lit/lit/LitConfig.py
===================================================================

--- utils/lit/lit/LitConfig.py	(revision 253080)
+++ utils/lit/lit/LitConfig.py	(working copy)
@@ -21,7 +21,8 @@
     def __init__(self, progname, path, quiet,
                  useValgrind, valgrindLeakCheck, valgrindArgs,
                  noExecute, debug, isWindows,
-                 params, config_prefix = None):
+                 params, config_prefix = None,
+                 timeout = 0):
         # The name of the test runner.
         self.progname = progname
         # The items to add to the PATH environment variable.
@@ -57,6 +58,7 @@
                 self.valgrindArgs.append('--leak-check=no')
             self.valgrindArgs.extend(self.valgrindUserArgs)
 
+        self.timeout = timeout
 
     def load_config(self, config, path):
         """load_config(config, path) - Load a config object from an alternate
Index: utils/lit/lit/Test.py
===================================================================
--- utils/lit/lit/Test.py	(revision 253080)
+++ utils/lit/lit/Test.py	(working copy)
@@ -33,6 +33,7 @@
 XPASS       = ResultCode('XPASS', True)
 UNRESOLVED  = ResultCode('UNRESOLVED', True)
 UNSUPPORTED = ResultCode('UNSUPPORTED', False)
+TIMEOUT     = ResultCode('TIMEOUT', True)
 
 # Test metric values.
 
Index: utils/lit/lit/TestRunner.py
===================================================================
--- utils/lit/lit/TestRunner.py	(revision 253080)
+++ utils/lit/lit/TestRunner.py	(working copy)
@@ -3,6 +3,7 @@
 import re
 import platform
 import tempfile
+import threading
 
 import lit.ShUtil as ShUtil
 import lit.Test as Test
@@ -14,6 +15,11 @@
         self.command = command
         self.message = message
 
+class TestTimeoutError(Exception):
+    def __init__(self, command, message):
+        self.command = command
+        self.message = message
+
 kIsWindows = platform.system() == 'Windows'
 
 # Don't use close_fds on Windows.
@@ -33,28 +39,144 @@
         self.cwd = cwd
         self.env = dict(env)
 
-def executeShCmd(cmd, shenv, results):
+class TimeoutHelper(object):
+    """
+        Object used to helper manage enforcing a timeout in
+        _executeShCmd(). It is passed through recursive calls
+        to collect processes that have been executed so that when
+        the timeout happens they can be killed.
+    """
+    def __init__(self, timeout):
+        self.timeout = timeout
+        self._procs = []
+        self._timeoutReached = False
+        self._doneKillPass = False
+        # This lock will be used to protect concurrent access
+        # to _procs and _doneKillPass
+        self._lock = None
+        self._timer = None
+
+    def cancel(self):
+        if not self.active():
+            return
+        self._timer.cancel()
+
+    def active(self):
+        return self.timeout > 0
+
+    def addProcess(self, proc):
+        if not self.active():
+            return
+        needToRunKill = False
+        try:
+            self._lock.acquire()
+            self._procs.append(proc)
+            needToRunKill = self._doneKillPass
+        finally:
+            self._lock.release()
+
+        # The initial call to _kill() from the timer thread
+        # already happened so we need to call it again from
+        # this thread, otherwise this process will be left to
+        # run even though the timeout was already hit
+        if needToRunKill:
+            assert self.timeoutReached()
+            self._kill()
+
+    def startTimer(self):
+        if not self.active():
+            return
+
+        # Do some late initialisation that's only needed
+        # if there is a timeout set
+        import psutil
+        self._lock = threading.Lock()
+        self._timer = threading.Timer(self.timeout, self._handleTimeoutReached)
+        self._timer.start()
+
+    def _handleTimeoutReached(self):
+        self._timeoutReached = True
+        self._kill()
+
+    def timeoutReached(self):
+        return self._timeoutReached
+
+
+    def _kill(self):
+        """
+            This method may be called multiple times as we might get unlucky
+            and be in the middle of creating a new process in _executeShCmd()
+            which won't yet be in ``self._procs``. By locking here and in
+            addProcess() we should be able to kill processes launched after
+            the initial call to _kill()
+        """
+        # Replace with with statement?
+        try:
+            self._lock.acquire()
+            import psutil
+            for p in self._procs:
+                try:
+                    psutilProc = psutil.Process(p.pid)
+                    for child in psutilProc.children(recursive=True):
+                        try:
+                            child.kill()
+                        except psutil.NoSuchProcess:
+                            pass
+                    psutilProc.kill()
+                except psutil.NoSuchProcess:
+                    pass
+
+                # Empty the list and note that we've done a pass over the list
+                self._procs = [] # Python2 doesn't have list.clear()
+                self._doneKillPass = True
+        finally:
+            self._lock.release()
+
+def executeShCmd(cmd, shenv, results, timeout=0):
+    """
+        Wrapper around _executeShCmd that handles
+        timeout
+    """
+    # Use the helper even when no timeout is required to make
+    # other code simpler. ``timeoutHelper.timeoutReached()``
+    # will always return False
+    timeoutHelper = TimeoutHelper(timeout)
+    if timeout > 0:
+        timeoutHelper.startTimer()
+    finalExitCode = _executeShCmd(cmd, shenv, results, timeoutHelper)
+    timeoutHelper.cancel()
+    if timeoutHelper.timeoutReached():
+        raise TestTimeoutError(cmd, 'Reached test timeout of {} seconds'.format(timeout))
+    return finalExitCode
+
+
+def _executeShCmd(cmd, shenv, results, timeoutHelper):
+    if timeoutHelper.timeoutReached():
+        # Prevent further recursion if the timeout has been hit
+        # as we should try avoid launching more processes.
+        return None
+
     if isinstance(cmd, ShUtil.Seq):
         if cmd.op == ';':
-            res = executeShCmd(cmd.lhs, shenv, results)
-            return executeShCmd(cmd.rhs, shenv, results)
+            res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
+            return _executeShCmd(cmd.rhs, shenv, results, timeoutHelper)
 
         if cmd.op == '&':
             raise InternalShellError(cmd,"unsupported shell operator: '&'")
 
         if cmd.op == '||':
-            res = executeShCmd(cmd.lhs, shenv, results)
+            res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
             if res != 0:
-                res = executeShCmd(cmd.rhs, shenv, results)
+                res = _executeShCmd(cmd.rhs, shenv, results, timeoutHelper)
             return res
 
         if cmd.op == '&&':
-            res = executeShCmd(cmd.lhs, shenv, results)
+            res = _executeShCmd(cmd.lhs, shenv, results, timeoutHelper)
             if res is None:
                 return res
 
             if res == 0:
-                res = executeShCmd(cmd.rhs, shenv, results)
+                res = _executeShCmd(cmd.rhs, shenv, results, timeoutHelper)
             return res
 
         raise ValueError('Unknown shell command: %r' % cmd.op)
@@ -206,6 +328,8 @@
                                           stderr = stderr,
                                           env = cmd_shenv.env,
                                           close_fds = kUseCloseFDs))
+            # Let the helper know about this process
+            timeoutHelper.addProcess(procs[-1])
         except OSError as e:
             raise InternalShellError(j, 'Could not create process ({}) due to {}'.format(executable, e))
 
@@ -311,7 +435,7 @@
     results = []
     try:
         shenv = ShellEnvironment(cwd, test.config.environment)
-        exitCode = executeShCmd(cmd, shenv, results)
+        exitCode = executeShCmd(cmd, shenv, results, timeout=litConfig.timeout)
     except InternalShellError:
         e = sys.exc_info()[1]
         exitCode = 127
@@ -359,8 +483,12 @@
             # run on clang with no real loss.
             command = litConfig.valgrindArgs + command
 
-    return lit.util.executeCommand(command, cwd=cwd,
-                                   env=test.config.environment)
+    try:
+        return lit.util.executeCommand(command, cwd=cwd,
+                                       env=test.config.environment,
+                                       timeout=litConfig.timeout)
+    except lit.util.ExecuteCommandTimeoutException:
+        raise TestTimeoutError(command, 'Hit timeout of {} seconds'.format(litConfig.timeout))
 
 def parseIntegratedTestScriptCommands(source_path, keywords):
     """
@@ -564,12 +692,15 @@
     lit.util.mkdir_p(os.path.dirname(tmpBase))
 
     execdir = os.path.dirname(test.getExecPath())
-    if useExternalSh:
-        res = executeScript(test, litConfig, tmpBase, script, execdir)
-    else:
-        res = executeScriptInternal(test, litConfig, tmpBase, script, execdir)
-    if isinstance(res, lit.Test.Result):
-        return res
+    try:
+        if useExternalSh:
+            res = executeScript(test, litConfig, tmpBase, script, execdir)
+        else:
+            res = executeScriptInternal(test, litConfig, tmpBase, script, execdir)
+        if isinstance(res, lit.Test.Result):
+            return res
+    except TestTimeoutError as e:
+        return lit.Test.Result(Test.TIMEOUT, e.message)
 
     out,err,exitCode = res
     if exitCode == 0:
Index: utils/lit/lit/formats/googletest.py
===================================================================
--- utils/lit/lit/formats/googletest.py	(revision 253080)
+++ utils/lit/lit/formats/googletest.py	(working copy)
@@ -109,8 +109,15 @@
         if litConfig.noExecute:
             return lit.Test.PASS, ''
 
-        out, err, exitCode = lit.util.executeCommand(
-            cmd, env=test.config.environment)
+        try:
+            out, err, exitCode = lit.util.executeCommand(
+                cmd, env=test.config.environment,
+                timeout=litConfig.timeout)
+        except lit.util.ExecuteCommandTimeoutException:
+            return (lit.Test.TIMEOUT,
+                    'Reached timeout of {} seconds'.format(
+                        litConfig.timeout)
+                   )
 
         if exitCode:
             return lit.Test.FAIL, out + err
Index: utils/lit/lit/main.py
===================================================================
--- utils/lit/lit/main.py	(revision 253080)
+++ utils/lit/lit/main.py	(working copy)
@@ -205,6 +205,10 @@
     group.add_option("", "--xunit-xml-output", dest="xunit_output_file",
                       help=("Write XUnit-compatible XML test reports to the"
                             " specified file"), default=None)
+    group.add_option("", "--timeout", dest="timeout",
+                     help="Maximum time to spend running a single test (in seconds)."
+                     " 0 means no time limit. [Default: %default]",
+                    type=int, default=0)
     parser.add_option_group(group)
 
     group = OptionGroup(parser, "Test Selection")
@@ -275,6 +279,12 @@
             name,val = entry.split('=', 1)
         userParams[name] = val
 
+    if opts.timeout != 0:
+        try:
+            import psutil
+        except ImportError as e:
+            parser.error('psutil must be installed to use --timeout=N for N > 0')
+
     # Create the global config object.
     litConfig = lit.LitConfig.LitConfig(
         progname = os.path.basename(sys.argv[0]),
@@ -287,7 +297,8 @@
         debug = opts.debug,
         isWindows = isWindows,
         params = userParams,
-        config_prefix = opts.configPrefix)
+        config_prefix = opts.configPrefix,
+        timeout = opts.timeout)
 
     # Perform test discovery.
     run = lit.run.Run(litConfig,
@@ -377,7 +388,7 @@
         extra = ' of %d' % numTotalTests
     header = '-- Testing: %d%s tests, %d threads --'%(len(run.tests), extra,
                                                       opts.numThreads)
-
+    print('Using individual test timeout of {} seconds'.format(opts.timeout))
     progressBar = None
     if not opts.quiet:
         if opts.succinct and opts.useProgressBar:
@@ -447,7 +458,8 @@
                       ('Unsupported Tests  ', lit.Test.UNSUPPORTED),
                       ('Unresolved Tests   ', lit.Test.UNRESOLVED),
                       ('Unexpected Passes  ', lit.Test.XPASS),
-                      ('Unexpected Failures', lit.Test.FAIL)):
+                      ('Unexpected Failures', lit.Test.FAIL),
+                      ('Individual Timeouts', lit.Test.TIMEOUT)):
         if opts.quiet and not code.isFailure:
             continue
         N = len(byCode.get(code,[]))
Index: utils/lit/lit/util.py
===================================================================
--- utils/lit/lit/util.py	(revision 253080)
+++ utils/lit/lit/util.py	(working copy)
@@ -6,6 +6,7 @@
 import signal
 import subprocess
 import sys
+import threading
 
 def to_bytes(str):
     # Encode to UTF-8 to get binary data.
@@ -155,18 +156,60 @@
             pDigits, pfDigits, i*barH, pDigits, pfDigits, (i+1)*barH,
             '*'*w, ' '*(barW-w), cDigits, len(row), cDigits, len(items)))
 
+class ExecuteCommandTimeoutException(Exception):
+    pass
 # Close extra file handles on UNIX (on Windows this cannot be done while
 # also redirecting input).
 kUseCloseFDs = not (platform.system() == 'Windows')
-def executeCommand(command, cwd=None, env=None, input=None):
+def executeCommand(command, cwd=None, env=None, input=None, timeout=0):
+    if timeout > 0:
+        # We need psutil to help us kill child processes in
+        # a clean manner. Psutil has a wait() with a timeout but
+        # it can't be used because we block with communicate() which
+        # doesn't take a timeout. To use it we would need to replace
+        # the pipes with files (to avoid deadlock) and just use wait()
+        # and read the files afterwards
+        import psutil
     p = subprocess.Popen(command, cwd=cwd,
                          stdin=subprocess.PIPE,
                          stdout=subprocess.PIPE,
                          stderr=subprocess.PIPE,
                          env=env, close_fds=kUseCloseFDs)
-    out,err = p.communicate(input=input)
-    exitCode = p.wait()
+    timerObject = None
+    # FIXME: Because of the way nested function scopes work in Python 2.x we
+    # need to use a reference to a mutable object rather than a plain
+    # bool. In Python 3 we could use the "nonlocal" keyword but we need
+    # to support Python 2 as well.
+    hitTimeOut = [False]
+    try:
+        if timeout > 0:
+            def killProcess():
+                # We may be invoking a shell so we need to kill the
+                # process and all its children. Using psutil seems
+                # like the easiest way to do this.
+                hitTimeOut[0] = True
+                try:
+                    psutilProc = psutil.Process(p.pid)
+                    for child in psutilProc.children(recursive=True):
+                        try:
+                            child.kill()
+                        except psutil.NoSuchProcess:
+                            pass
+                    psutilProc.kill()
+                except psutil.NoSuchProcess:
+                    pass
+            timerObject = threading.Timer(timeout, killProcess);
+            timerObject.start()
 
+        out,err = p.communicate(input=input)
+        exitCode = p.wait()
+    finally:
+        if timerObject != None:
+            timerObject.cancel()
+
+    if hitTimeOut[0]:
+        raise ExecuteCommandTimeoutException()
+
     # Detect Ctrl-C in subprocess.
     if exitCode == -signal.SIGINT:
         raise KeyboardInterrupt
Index: utils/lit/tests/Inputs/timeout/infloop.py
===================================================================
--- utils/lit/tests/Inputs/timeout/infloop.py	(revision 0)
+++ utils/lit/tests/Inputs/timeout/infloop.py	(working copy)
@@ -0,0 +1,8 @@
+# RUN: %{python} %s
+# XFAIL: *
+
+import sys
+
+print "infinite loop"
+while True:
+    pass
Index: utils/lit/tests/Inputs/timeout/lit.cfg
===================================================================
--- utils/lit/tests/Inputs/timeout/lit.cfg	(revision 0)
+++ utils/lit/tests/Inputs/timeout/lit.cfg	(working copy)
@@ -0,0 +1,16 @@
+# -*- Python -*-
+
+import os
+import sys
+
+import lit.formats
+
+config.name = 'timeout'
+config.test_format = lit.formats.ShTest(execute_external=False)
+config.suffixes = ['.py']
+config.test_source_root = os.path.dirname(__file__)
+config.test_exec_root = config.test_source_root
+config.target_triple = '(unused)'
+src_root = os.path.join(config.test_source_root, '..')
+config.environment['PYTHONPATH'] = src_root
+config.substitutions.append(('%{python}', sys.executable))
Index: utils/lit/tests/Inputs/timeout/short.py
===================================================================
--- utils/lit/tests/Inputs/timeout/short.py	(revision 0)
+++ utils/lit/tests/Inputs/timeout/short.py	(working copy)
@@ -0,0 +1,5 @@
+# RUN: %{python} %s
+
+import sys
+
+print "short program"
Index: utils/lit/tests/timeout.py
===================================================================
--- utils/lit/tests/timeout.py	(revision 0)
+++ utils/lit/tests/timeout.py	(working copy)
@@ -0,0 +1,11 @@
+# RUN: not %{lit} \
+# RUN: %{inputs}/timeout/infloop.py \
+# RUN: %{inputs}/timeout/short.py \
+# RUN:   -j 1 -v --debug --timeout 1 > %t.out 2> %t.err
+# RUN: FileCheck --check-prefix=CHECK-OUT < %t.out %s
+#
+
+# CHECK-OUT: TIMEOUT: timeout :: infloop.py
+# CHECK-OUT: PASS: timeout :: short.py
+# CHECK-OUT: Expected Passes{{ *}}: 1
+# CHECK-OUT: Individual Timeouts{{ *}}: 1