<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/55929>55929</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [OpenMP] Nondeterministic hang on Windows thread start-up
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          branh
      </td>
    </tr>
</table>

<pre>
    We've noticed a non-deterministic hang in __kmp_suspend_initialize_thread in z_Windows_NT_util.cpp on Windows. It occurs most frequently on our test machines when some threads have very little to do, such as in our tests where the entire content of a parallel region is a task executed by a single thread.

[hang64.zip](https://github.com/llvm/llvm-project/files/8856475/hang64.zip)
 
The call stack is 
01 000000ed`766ff770 00007ffa`fa029bc2     libomp140_x86_64!__kmp_suspend_initialize_thread(union kmp_info * th = 0x0000026c`1feef380)+0xf2 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\z_Windows_NT_util.cpp @ 321] 

02 000000ed`766ff7a0 00007ffa`fa029a30     libomp140_x86_64!__kmp_free_thread(union kmp_info * this_th = 0x0000026c`1feef380)+0x152 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 5631] 

03 000000ed`766ff7d0 00007ffa`fa026d08     libomp140_x86_64!__kmp_free_team(union kmp_root * root = <Value unavailable error>, union kmp_team * team = 0x0000026c`1fecb9c0, union kmp_info * master = 0x00000000`00000000)+0x2a0 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 5455] 

04 000000ed`766ff850 00007ffa`fa026eb2     libomp140_x86_64!__kmp_reset_root(int gtid = 0n535586624, union kmp_root * root = 0x0000026c`1fec6740)+0x1c8 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 3885] 

05 000000ed`766ff8c0 00007ffa`fa02a4d9     libomp140_x86_64!__kmp_unregister_root_current_thread(int gtid = 0n0)+0x112 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 3977] 

06 000000ed`766ff8f0 00007ffa`fa0270be     libomp140_x86_64!__kmp_internal_end_library(int gtid_req = <Value unavailable error>)+0x99 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 6149] 

Looking at the relevant code, the hang is in this while loop in _kmp_suspend_initialize_thread (which is unchanged between the version of libomp we are using and the latest in LLVM):
```
void __kmp_suspend_initialize_thread(kmp_info_t *th) {
  int old_value = KMP_ATOMIC_LD_RLX(&th->th.th_suspend_init);
  int new_value = TRUE;                                     // Defined to be 1.
  // Return if already initialized
  if (old_value == new_value)
    return;
  // Wait, then return if being initialized
  if (old_value == -1 ||
      !__kmp_atomic_compare_store(&th->th.th_suspend_init, old_value, -1)) {
    while (KMP_ATOMIC_LD_ACQ(&th->th.th_suspend_init) != new_value) {
      KMP_CPU_PAUSE();
    } else {
    // Claim to be the initializer and do initializations
    __kmp_win32_cond_init(&th->th.th_suspend_cv);
    __kmp_win32_mutex_init(&th->th.th_suspend_mx);
    KMP_ATOMIC_ST_REL(&th->th.th_suspend_init, new_value);
  }
}
```
To summarize, the first thread attempts to execute this logic sets th->th.th_suspend_init to -1 during initialization, then does some initialization, and finally sets th->th.th_suspend_init to 1. Subsequent threads either return immediately if th->th.th_suspend_init is set to 1, or, if th->th.th_suspend_init is set to -1, recognize that another thread is in the process of initializing and wait until that thread sets th->th.th_suspend_init to 1 before continuing.

The problem is that, during a single iteration in the spin loop in __kmp_suspend_initialize_thread, the thread that is doing the initializing could not only finish initializing, but also call __kmp_supsend_uninitialize_thread:
```
void __kmp_suspend_uninitialize_thread(kmp_info_t *th) {
  if (KMP_ATOMIC_LD_ACQ(&th->th.th_suspend_init)) {
    /* this means we have initialize the suspension pthread objects for this
       thread in this instance of the process */
    __kmp_win32_cond_destroy(&th->th.th_suspend_cv);
    __kmp_win32_mutex_destroy(&th->th.th_suspend_mx);
    KMP_ATOMIC_ST_REL(&th->th.th_suspend_init, FALSE);   // FALSE is defined to be 0.
  }
}
```

If this happens, th->th.th_suspend_init is set to 0 and the thread executing __kmp_suspend_initialize_thread makes no further progress.

I think I've identified the cause of the hang, but I'm not sure on the correct fix. The very next call from the caller __kmp_free_thread is a call to __kmp_lock_suspend_mx(this_th), which relies on th->th.th_suspend_mx being in a valid state. The call to __kmp_suspend_uninitialize_thread that causes the deadlock will also have destroyed this mutex.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy9WFtzm7oW_jXkRRMPxgbbD35I7XQms9O9u9u0PW-MEMLWCUiuJOykv_58Ehib0MY5tzK-gJDWWvrWXZnKn5ffeBDN9pxIZQXjOaG4k9c5t1xXQgqDUbKlckOEJGn6WO1SU5sdl3mKt1bQUvzgqd1qTnM35Uf6TchcHUz650NaW1GO2G5HlCTt8IjcWaIYq7UhlTKWFJp_r7m05bObpWpNLMdwRdlWSG7IYcslMaripGFiIA3E3XP9TEphbYkXiuQqiFbE1GxLqHFyHAl5Atot5gRcBG6Zkha3RBXY7I5qWpa8JJpvBAQQBoOWmkfCnzirLRDJnjFkhNyURxlGQbgOwpv2N37n8Emmox9iF8TrIJpvrd2ZYHITRO_x2Qi7rbMRUxUeynJ__LveafVPziweC1Fyg__5PE6msxh3ZySjRcOINH8P2AmDzMRYyh6dwM14OCahv3geJOEsSYpiNgv92KwoKMYKGkaLjEXEXaXIVLUbT8P0aZ6kyTSIxhfUi43V0mHkZglZKBJEN0CEBJM1CZ887yhhYDQuOC8m89CJHr0Ln4qIAKW1QyRe0XCM3_Sg9CP-5_ga99UMv3tmlSrdMzVU4s_DFa8UJKoA7krX0GHFuwU_N7dgGpJJBDZrcq6pMBoiRIcI0Un4OkIw2UuYCJO-BZhx_H9DxgnUjnWYxMlkCMpkCEo-ACXJw_lbQOG06kGilbIekuYGeAST1Vda1pzUku6pKGkGt-JaKx1Mbp0TnxY7ag2e_uYnYLJswcL-ok4JFTWIYefLcGFZd9vqIIIJ_E4dTON4oIPpQAfzeKgDnl1wXc0Ntx5zKEEgxG2syBsEZDyJ43mSRNM-XEMFDTBOZtOTwbL57wRrgoA4ACsegsUGYNFpvngdrFq6kO9sxIOQIiVppIWTY78A8ITB-Lc67WQxmw0wSIYYFAMMZmHGX8cAW-Ra0jJ1AR-zNNXPZzuHQX1_i9M2uCwWvxOWZDxdvITlXqlH5GlCrc_3mpd8T7EXpnLuzN4NNrWMrxFcmEZ5gNxLSqV2vsJ5vcABNpiPIgMLa8kcLVcgcHvgXHryqEuM8y0UFw3q5MAJRdVRGy-ZzP20kvoqBxzv779-cAhOjuVEErYf_7hXMMDLifkY-VLvzHYLiiSYvWsLB-IUqso83XslOo3-8eFjevPw14e7VXq_Tj_d_wNEgiix22to1G5Hdtvj5yXskZP8cEbu4dOXW0wgb7maqoiseYECL3fVG8x0PDoSb19_4rbWqMdQpJVuk8_ktO-8E6RwKuntzEnTyXaqnXBpT_FsGy2nb9Ttz1uHbCc5whkXvup9K9PrMRBfuU_H0bFoPY1aVQmWoghExclTY5XmlyBfnXTmHq7H3tN6iiWt_YJUX6M3q78vatRJ9xKuPnHi7WT18Uv68ebL51tP8dwQsMHZmvDS8P66FtlVSUXVKthZ_QlM7T0hV6chauE25kSige0g5CQCap3Qv9oS27-U7JxAhUL-6QKF6uklhTNEPz-kn27vLyusZ3lnljZbt97d3fTd_EGheakqqoHNMVIVQhvbdhwIaZZXO7QzQLPtTJr4VaoNOjTkfbz6hWBuDawzr3XPoj3ineHnCs2W77OGE5yu4KxoOp4vcxqPyOc6M01X1zVtHE0QtH50r6riuUAMBEG40y_JYX_g56l6d9Du920Lrv0KzZnaSICKJcgKFF2uE-PYr7ZpgBO0Yowb48J2t_1jvD4gPhCXfcqGSLv4MhAw-0K1DaeQNej1GseHhi9SaeUkcbSdyK2aupZTIEV7TRxlNTvcdAnrUmpobKmV2csPXrlyLHou6QaYqsvcnQSgFYdmCtf_b3tTHL2sBpClUU0XehRgZ5wAKCwHIrw9s_1s9eXcVvxn0W8Q7HzQapo3UnEqjUvd_rDhJFWjAE_Ip_ldC6zKXCdvCPTtCZyHUHI6HfG0hUTnLhl3xnZue2DuRPh1BMxRMmj1_N8FwYtE_idx8P3N_efbhs4pHfhBb3691B-O3hwmm9-7ogFyS3dOD42NXwwIYVd8tfpowqiz-0vnWhV9RHCUihS19vEDKtug2TI9f75zUslHctccqIncnTcVgjdMGa1Np3JXOB5dyU2vvNOZGsFCNU7OFLoRZuGDTyPysG0PvCR_so3bFVpVLd2yhESDk4nmJMvPxeab16Vijz09z9uTCu8OK9LUtiibBbbrBfmZeXSVEegj18GLYc-WN2L2Gb7i2E0s8qgYv5Ecg05AchAg4QOMd77WYD2MzjOdFY-u8uUkX0wW9MoKW_IlGo-_wOnDR9cS_AlnGZ5fns4guwhuqbbX9e6q1uXy3z6zE8bU_tAujhfR4mq7ZAueZ9M4HxdJTufhZDxeTOdjtqBZwaNoFl2hbUKp5GQNogiVAvEkcA-hr8QyCqMoRDcXocGcLkbZlM3zOZo2ms-zuGBoeniF3mvk5BgpvbnSSy9SVm8MXpbYqzm9pMaIjeQeGkef1nar9BItntxeecZLL_i_AHtVBrA">