<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/55929>55929</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[OpenMP] Nondeterministic hang on Windows thread start-up
</td>
</tr>
<tr>
<th>Labels</th>
<td>
new issue
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
branh
</td>
</tr>
</table>
<pre>
We've noticed a non-deterministic hang in __kmp_suspend_initialize_thread in z_Windows_NT_util.cpp on Windows. It occurs most frequently on our test machines when some threads have very little to do, such as in our tests where the entire content of a parallel region is a task executed by a single thread.
[hang64.zip](https://github.com/llvm/llvm-project/files/8856475/hang64.zip)
The call stack is
01 000000ed`766ff770 00007ffa`fa029bc2 libomp140_x86_64!__kmp_suspend_initialize_thread(union kmp_info * th = 0x0000026c`1feef380)+0xf2 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\z_Windows_NT_util.cpp @ 321]
02 000000ed`766ff7a0 00007ffa`fa029a30 libomp140_x86_64!__kmp_free_thread(union kmp_info * this_th = 0x0000026c`1feef380)+0x152 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 5631]
03 000000ed`766ff7d0 00007ffa`fa026d08 libomp140_x86_64!__kmp_free_team(union kmp_root * root = <Value unavailable error>, union kmp_team * team = 0x0000026c`1fecb9c0, union kmp_info * master = 0x00000000`00000000)+0x2a0 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 5455]
04 000000ed`766ff850 00007ffa`fa026eb2 libomp140_x86_64!__kmp_reset_root(int gtid = 0n535586624, union kmp_root * root = 0x0000026c`1fec6740)+0x1c8 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 3885]
05 000000ed`766ff8c0 00007ffa`fa02a4d9 libomp140_x86_64!__kmp_unregister_root_current_thread(int gtid = 0n0)+0x112 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 3977]
06 000000ed`766ff8f0 00007ffa`fa0270be libomp140_x86_64!__kmp_internal_end_library(int gtid_req = <Value unavailable error>)+0x99 [D:\a01\_work\8\s\src\vctools\asan\llvm\openmp\runtime\src\kmp_runtime.cpp @ 6149]
Looking at the relevant code, the hang is in this while loop in _kmp_suspend_initialize_thread (which is unchanged between the version of libomp we are using and the latest in LLVM):
```
void __kmp_suspend_initialize_thread(kmp_info_t *th) {
int old_value = KMP_ATOMIC_LD_RLX(&th->th.th_suspend_init);
int new_value = TRUE; // Defined to be 1.
// Return if already initialized
if (old_value == new_value)
return;
// Wait, then return if being initialized
if (old_value == -1 ||
!__kmp_atomic_compare_store(&th->th.th_suspend_init, old_value, -1)) {
while (KMP_ATOMIC_LD_ACQ(&th->th.th_suspend_init) != new_value) {
KMP_CPU_PAUSE();
} else {
// Claim to be the initializer and do initializations
__kmp_win32_cond_init(&th->th.th_suspend_cv);
__kmp_win32_mutex_init(&th->th.th_suspend_mx);
KMP_ATOMIC_ST_REL(&th->th.th_suspend_init, new_value);
}
}
```
To summarize, the first thread attempts to execute this logic sets th->th.th_suspend_init to -1 during initialization, then does some initialization, and finally sets th->th.th_suspend_init to 1. Subsequent threads either return immediately if th->th.th_suspend_init is set to 1, or, if th->th.th_suspend_init is set to -1, recognize that another thread is in the process of initializing and wait until that thread sets th->th.th_suspend_init to 1 before continuing.
The problem is that, during a single iteration in the spin loop in __kmp_suspend_initialize_thread, the thread that is doing the initializing could not only finish initializing, but also call __kmp_supsend_uninitialize_thread:
```
void __kmp_suspend_uninitialize_thread(kmp_info_t *th) {
if (KMP_ATOMIC_LD_ACQ(&th->th.th_suspend_init)) {
/* this means we have initialize the suspension pthread objects for this
thread in this instance of the process */
__kmp_win32_cond_destroy(&th->th.th_suspend_cv);
__kmp_win32_mutex_destroy(&th->th.th_suspend_mx);
KMP_ATOMIC_ST_REL(&th->th.th_suspend_init, FALSE); // FALSE is defined to be 0.
}
}
```
If this happens, th->th.th_suspend_init is set to 0 and the thread executing __kmp_suspend_initialize_thread makes no further progress.
I think I've identified the cause of the hang, but I'm not sure on the correct fix. The very next call from the caller __kmp_free_thread is a call to __kmp_lock_suspend_mx(this_th), which relies on th->th.th_suspend_mx being in a valid state. The call to __kmp_suspend_uninitialize_thread that causes the deadlock will also have destroyed this mutex.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJy9WFtzm7oW_jXkRRMPxgbbD35I7XQms9O9u9u0PW-MEMLWCUiuJOykv_58Ehib0MY5tzK-gJDWWvrWXZnKn5ffeBDN9pxIZQXjOaG4k9c5t1xXQgqDUbKlckOEJGn6WO1SU5sdl3mKt1bQUvzgqd1qTnM35Uf6TchcHUz650NaW1GO2G5HlCTt8IjcWaIYq7UhlTKWFJp_r7m05bObpWpNLMdwRdlWSG7IYcslMaripGFiIA3E3XP9TEphbYkXiuQqiFbE1GxLqHFyHAl5Atot5gRcBG6Zkha3RBXY7I5qWpa8JJpvBAQQBoOWmkfCnzirLRDJnjFkhNyURxlGQbgOwpv2N37n8Emmox9iF8TrIJpvrd2ZYHITRO_x2Qi7rbMRUxUeynJ__LveafVPziweC1Fyg__5PE6msxh3ZySjRcOINH8P2AmDzMRYyh6dwM14OCahv3geJOEsSYpiNgv92KwoKMYKGkaLjEXEXaXIVLUbT8P0aZ6kyTSIxhfUi43V0mHkZglZKBJEN0CEBJM1CZ887yhhYDQuOC8m89CJHr0Ln4qIAKW1QyRe0XCM3_Sg9CP-5_ga99UMv3tmlSrdMzVU4s_DFa8UJKoA7krX0GHFuwU_N7dgGpJJBDZrcq6pMBoiRIcI0Un4OkIw2UuYCJO-BZhx_H9DxgnUjnWYxMlkCMpkCEo-ACXJw_lbQOG06kGilbIekuYGeAST1Vda1pzUku6pKGkGt-JaKx1Mbp0TnxY7ag2e_uYnYLJswcL-ok4JFTWIYefLcGFZd9vqIIIJ_E4dTON4oIPpQAfzeKgDnl1wXc0Ntx5zKEEgxG2syBsEZDyJ43mSRNM-XEMFDTBOZtOTwbL57wRrgoA4ACsegsUGYNFpvngdrFq6kO9sxIOQIiVppIWTY78A8ITB-Lc67WQxmw0wSIYYFAMMZmHGX8cAW-Ra0jJ1AR-zNNXPZzuHQX1_i9M2uCwWvxOWZDxdvITlXqlH5GlCrc_3mpd8T7EXpnLuzN4NNrWMrxFcmEZ5gNxLSqV2vsJ5vcABNpiPIgMLa8kcLVcgcHvgXHryqEuM8y0UFw3q5MAJRdVRGy-ZzP20kvoqBxzv779-cAhOjuVEErYf_7hXMMDLifkY-VLvzHYLiiSYvWsLB-IUqso83XslOo3-8eFjevPw14e7VXq_Tj_d_wNEgiix22to1G5Hdtvj5yXskZP8cEbu4dOXW0wgb7maqoiseYECL3fVG8x0PDoSb19_4rbWqMdQpJVuk8_ktO-8E6RwKuntzEnTyXaqnXBpT_FsGy2nb9Ttz1uHbCc5whkXvup9K9PrMRBfuU_H0bFoPY1aVQmWoghExclTY5XmlyBfnXTmHq7H3tN6iiWt_YJUX6M3q78vatRJ9xKuPnHi7WT18Uv68ebL51tP8dwQsMHZmvDS8P66FtlVSUXVKthZ_QlM7T0hV6chauE25kSige0g5CQCap3Qv9oS27-U7JxAhUL-6QKF6uklhTNEPz-kn27vLyusZ3lnljZbt97d3fTd_EGheakqqoHNMVIVQhvbdhwIaZZXO7QzQLPtTJr4VaoNOjTkfbz6hWBuDawzr3XPoj3ineHnCs2W77OGE5yu4KxoOp4vcxqPyOc6M01X1zVtHE0QtH50r6riuUAMBEG40y_JYX_g56l6d9Du920Lrv0KzZnaSICKJcgKFF2uE-PYr7ZpgBO0Yowb48J2t_1jvD4gPhCXfcqGSLv4MhAw-0K1DaeQNej1GseHhi9SaeUkcbSdyK2aupZTIEV7TRxlNTvcdAnrUmpobKmV2csPXrlyLHou6QaYqsvcnQSgFYdmCtf_b3tTHL2sBpClUU0XehRgZ5wAKCwHIrw9s_1s9eXcVvxn0W8Q7HzQapo3UnEqjUvd_rDhJFWjAE_Ip_ldC6zKXCdvCPTtCZyHUHI6HfG0hUTnLhl3xnZue2DuRPh1BMxRMmj1_N8FwYtE_idx8P3N_efbhs4pHfhBb3691B-O3hwmm9-7ogFyS3dOD42NXwwIYVd8tfpowqiz-0vnWhV9RHCUihS19vEDKtug2TI9f75zUslHctccqIncnTcVgjdMGa1Np3JXOB5dyU2vvNOZGsFCNU7OFLoRZuGDTyPysG0PvCR_so3bFVpVLd2yhESDk4nmJMvPxeab16Vijz09z9uTCu8OK9LUtiibBbbrBfmZeXSVEegj18GLYc-WN2L2Gb7i2E0s8qgYv5Ecg05AchAg4QOMd77WYD2MzjOdFY-u8uUkX0wW9MoKW_IlGo-_wOnDR9cS_AlnGZ5fns4guwhuqbbX9e6q1uXy3z6zE8bU_tAujhfR4mq7ZAueZ9M4HxdJTufhZDxeTOdjtqBZwaNoFl2hbUKp5GQNogiVAvEkcA-hr8QyCqMoRDcXocGcLkbZlM3zOZo2ms-zuGBoeniF3mvk5BgpvbnSSy9SVm8MXpbYqzm9pMaIjeQeGkef1nar9BItntxeecZLL_i_AHtVBrA">