<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/78419>78419</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [BOLT] bolt_rt -instrumentation-sleep-time hangs when run under lit
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            BOLT
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          peterwaller-arm
      </td>
    </tr>
</table>

<pre>
    If using `llvm-bolt -instrument -instrumentation-sleep-time=N` for bolt instrumentation, `llvm-lit` hangs during profile collection.

The reason is that lit uses `subprocess.Popen.communicate`, which blocks until stdio is closed, and bolt_rt is forking `watchProcess` without closing file descriptors. `watchProcess()` then probes for when the PID of the parent process becomes invalid (indicating the workload has finished and been waited on), which it does not, because the process remains as a zombie until it is `wait()`-ed on by the parent process, which would ordinarily follow the `Popen.communicate`. This leads to a deadlock.

Quick and dirty reproducer to show the issue:

```
clang -x c - <<<$'#include <unistd.h>\nint main(int argc, char *argv[]) { int x = fork(); if (x == 0) { /* close(1); */ sleep(2); } return 0; }'
time ./a.out
// instant
time python3 -c 'from subprocess import Popen, PIPE; import sys; p = Popen(sys.argv[1], stdout=PIPE); p.communicate()' ./a.out
// blocks for 2 seconds because stdout is still open; removing comments around close(1) causes it to not block.
```

Relevant code fragments:

#### Forking and leaving stdio file descriptors open:
https://github.com/llvm/llvm-project/blob/94da2b21ee3f2baed729333ec8bbf96f92c1fa84/bolt/runtime/instr.cpp#L1655-L1657

#### lit using communicate()
https://github.com/llvm/llvm-project/blob/94da2b21ee3f2baed729333ec8bbf96f92c1fa84/llvm/utils/lit/lit/TestRunner.py#L917-L919

#### watchProcess() querying the parent process
https://github.com/llvm/llvm-project/blob/94da2b21ee3f2baed729333ec8bbf96f92c1fa84/bolt/runtime/instr.cpp#L1589

The quick fix to prevent the deadlock would be to close file descriptors 0,1,2 in the fork. However this brings to mind a second issue: there is nothing synchronizing the outer program with the completion of profile writing. This means that a build system invoking lit to gather and then use the profile data could continue onward to using the data before it has been written out. So closing the file descriptors would replace a hang with a race condition. Not ideal.

The way to avoid this would be to have `watchProcess()` be the parent process which waits for its fork to complete before writing the final profile and exiting itself; however from the documentation of `instrumentation-sleep-time` I see it's intended to allow the profile to be written out even if the parent process is killed (in which case the instrumentation would not be able to react to a signal), so it might be written this way intentionally. If the parent is being killed for example because of OOM, it would still work with the `watchProcess` as the parent, but if it's being killed because it is the named process, the profile would be lost.

As an aside, I'm using `instrumentation-sleep-time` because the binary in question is statically linked and therefore so far as I know, it was unable to use the usual mechanism for writing out the profile on exit (via DT_FINI? I'm hazy on the details here). If I missed something and there is a way to make it work with static binaries then I can use that without needing `instrumentation-sleep-time`.

/cc @aaupov and potential authors of watchProcess() @rafaelauler @yota9 @maksfb 

</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzMV92O27oRfhruzcCGTHn9c7EXm-wxauA0SU9zfzAiRxa7FKlDUnacpy-GlL3ezSZob4oCgmxJHM583_wSYzQHR_Qg7j-I-6c7HFPnw8NAicIJraUww9DfNV6fH_YtjNG4A4hVZe2xnzXeJpgZF1MYe3Kv_mMy3s2iJRpmyfQk6qdPYlVB6wNkuTdLhfx43deaxEs7dIcIegyscwi-NZZAeWtJscRcVE-ieiz3rx1BIIzegYmQOkxgTYIxUuRt49gMwSuKcf7FD-Tmyvf96IzCRGJVsfJTZ1QHjfXqOcLokrEQkzae91PWR9K8Cp3O5v8ZEn9ofXieGDlhUt2XooStP5nU-TFlWV6SrdcUVTBD8iHOfxCSGyG3LJo6cgy4oawBTvycOoIv-yfwbf47YGDGJ1TQkPI9RTDuiNZoEHJjnGZ8rJsFTj48W48aOozQGmdiR7rgIXJwQpNIAzti-0KHSaA9RXA-8cuGFI6RigGT5kA9GhcBIyB8931jaKLPZIoySpOu6GZZCzTnd2C8KD750WrwQRuHwdgztN5af8oyYlW958Q5fO1MBEuoIyQPCJpQsz9fRco_RqOeM25tQjpDoCF4PSoKLBS7SYmJcSRRP96KcqSUKz8qi-4As2-gYAai_jhdcinkWsjaOGVHTfxldCYmPe9E_Zu4_-iMS8CkZSclwHBQjFx1GEDIRwyHY0lHIbcg1h-AV30DUT_lgJuorD-AadnR-Qt_rC7rhdwJ-VjCVsjNYlou5KOQO8hJKeRGXl6vnyBQGoODanpmBBkjpy7Mhdzh3I9pooF33-UERpdu1g3n1HlXw0yBkOs2-B5eEg9MP_iQILuO4X7Zf_ktgyjv4zny05BhTos28RznEx2LzMdHTkq2pH7K8gXB8CoUCj1y_RO7pxTnxJIQSXmn4zWyy-4ctzEZayHbUX_gKPdHTiVWRC5FwOBHp19xDHmPyIGfPOdM0TV_N3rK_Q-ydESXQHlN0AY85N3fBp6srxfspprDIWwJs1WlUr0tMZP10yZdSkPeONNwMKkbG-ZNyB1X3elnNgT_L1JJyF1jfSPkbrvUKBu5IKpb2SDptdzWdU1q0zTtdtVupVq0uFmyhLcsGDj_exJyl4v8XA2DkPXvi9X9_Yzv659iKzX7wvNrh_6vQEz7jMnYyI9cu6b7V4rpj9E5CvPhzJC2i_Xs9-1i-1NAPxZ4-GukcL5U5Tf17__EUfeb7dvm-lcum635xqE9BDqy2YzgUmSnkt0QL8hZ8WM8cqNdCPlRgikNjcvZHP7mT3Tk-sv1u-F2nwt4b5wGnFL0WpBZLnB95gTrcvCfneqCd-b7hVU_JgpM6iFgnztxfq18P1ji2YG76GWiOAXDTXJqHz2hmwYIhGY0VnNlStRza_U58WzJ7wOyJTkNc8O-aYwFOCYElUlR3iXjRgLvThg0S5cozwTyuoZaz6hSbs-lIweTEjkGM4d_-usgkXl7y2whP9BgURFgnp0KcITAr5hDk8cm-OQTGE1of5igTnjOnfPojS7euHVqh0f6-dDS0HtzydTN0aRScaff5xwkxR10AT85YgLo0F6pZI7pW_lqUiTbck3uprDJjSYz6dXLQMkuFqvqFxPpqoI9RGLWhVzz7JTIacruweu0cbEheYZ44xTgHOAW_A5sE-HZWEvTIDbRoHAKkTdGTSznhkGATdEWCFUqgwzP6GinySx6DpPeHLp0a1FxF54LDN4WrT3PYf_KQE4wYhon89gn9A3ZEdce6Fv4_PnvrMqkybTSDHmEfMmmd2ZejDe68sDIvbS9EPxK80VbmRJZzGFP-nYUvKX_GojWx_QqcB8joAOMRhML7YVc9y8HlV_7_3agbXjUZPq4QsfslzwGYDKKqQRr3PM0MucalIM2emgxMPI9PDt_utCGfIi4uPKiYowjWuhJdehM7MtwP0U9R9QtYO9yyHMAHQ3C09c_d_tPe1HvJoQdfj_zolKDExobga0ScpudvofexEgaou-pFMqr5QwML9ne4zMVT1-8WzAXQgzFUt72oPBS5DBdjzeOSP8nVM9ft8idUiCWFeI4-GM2bPA5bNFCOYNGjsN32qdYVgFbJIujpcCPZ58wv-_xObYNFB13-qHW23qLd_SwWFf3y839ainvugepV0u5rlulabHd1k3bLmuqNa5Wy7pWUt2ZB1nJZbVYrBdVVS3lfLVZNaqucFUtK7WpKrGs-NBj59yH5z4c7nJzelhvlovtncWGbMwHaik_fP79q5CSj9bhoRyax0MUy8qamOLLBskkmw_hWeD-6XrI_MWRejog59NhGB2MTlPg7nQ3BvvwX08RGQNPPBnGvwMAAP__zwlgIw">