[Openmp-commits] [PATCH] D65112: [OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC.
Alexey Bataev via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Tue Jul 23 09:53:24 PDT 2019
ABataev added a comment.
In D65112#1597534 <https://reviews.llvm.org/D65112#1597534>, @Hahnfeld wrote:
> I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the `J` loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.
Yep, this is what I'm going to fix in my next patches.
But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use `__syncwarp(mask)` function instead.
Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
To reveal this problem, just enclose the code in `else` branch (`Count += omp_get_level(); // 6 * 1 = 6`) under control of another `#pragma omp critical`.
#pragma omp critical
Count += omp_get_level(); // 6 * 1 = 6
It must produce the same result as before but it won't, most probably.
> (I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)
Even now it does not work, it returns an incorrect result but does not hang at least.
> Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.
I'm not saying that we must keep the barrier, no. The barrier must be removed for sure. Bu the threads in the warp still must be synchronized. Otherwise, the result is not guaranteed. But in this case, we can operate only by the full warp, unfortunately.
Another possible solution could be just a spinlock in atomic operations to implement a lockless critical barrier. All the threads in the critical region must do this unconditionally. This is going to be much-much slower + I'm not sure that it will work. But if everybody is fine with the slower solution I can try to implement it.
CHANGES SINCE LAST ACTION
More information about the Openmp-commits