[Openmp-commits] [PATCH] D65112: [OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC.
Johannes Doerfert via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Tue Jul 23 08:18:50 PDT 2019
jdoerfert added a comment.
In D65112#1597534 <https://reviews.llvm.org/D65112#1597534>, @Hahnfeld wrote:
> I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the `J` loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.
> (I have the following theory why the updated test case works, even though still not all threads execute the critical regions (only 64 threads for the first one, only 32 for the second one): Because these are now at warp boundary, the other warps can already finish the kernel after they went through their barriers. Afterwards, I think they don't need to participate in excess barriers and the kernel terminates. I guess this does not work if the number of barriers doesn't match for threads inside the same warp.)
> Can we implement critical regions without barriers? If yes, I think we should do this (even if it's slower), because the current scheme may lead to deadlocks for conformant OpenMP programs.
Thank you for this in-depth analysis. This makes way more sense now. I agree, we cannot deadlock/miscompile/forbid a conformant OpenMP program but instead need to implement the semantics correctly.
Btw., we checked XL and it does work as expected on this test case.
CHANGES SINCE LAST ACTION
More information about the Openmp-commits