[Openmp-commits] [PATCH] D65112: [OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC.
Jonas Hahnfeld via Phabricator via Openmp-commits
openmp-commits at lists.llvm.org
Tue Jul 23 11:11:46 PDT 2019
Hahnfeld added a comment.
In D65112#1597722 <https://reviews.llvm.org/D65112#1597722>, @ABataev wrote:
> In D65112#1597706 <https://reviews.llvm.org/D65112#1597706>, @Hahnfeld wrote:
> > In D65112#1597673 <https://reviews.llvm.org/D65112#1597673>, @ABataev wrote:
> > > In D65112#1597534 <https://reviews.llvm.org/D65112#1597534>, @Hahnfeld wrote:
> > >
> > > > I can reproduce that this test hangs on our Volta GPUs and I debugged it briefly: The problem seems to be how Clang emits critical regions, more precisely that the generated code introduces a barrier. Effectively, this assumes that all threads pass via all critical regions the same number of times, making it a worksharing construct. Obviously, this is not true for the current code, only 10 threads are needed to execute the `J` loop and all other threads wait at the end of the kernel. If I manually remove the barrier(s) from the generated code, the executable finishes and prints the correct results.
> > >
> > >
> > > Yep, this is what I'm going to fix in my next patches.
> > I think we should fix this first instead of relaxing a test that fails for something that is easy to fix.
> > > But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
> > > The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use `__syncwarp(mask)` function instead.
> > > Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
> > > To reveal this problem, just enclose the code in `else` branch (`Count += omp_get_level(); // 6 * 1 = 6`) under control of another `#pragma omp critical`.
> > >
> > > #pragma omp critical
> > > Count += omp_get_level(); // 6 * 1 = 6
> > >
> > >
> > > It must produce the same result as before but it won't, most probably.
> > I still get the correct results. Do you have a test that you know to fail?
> I get `Expected count = 67` with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3 <https://reviews.llvm.org/owners/package/3/>?
At first I didn't, but now the original test case with added `critical` in the `else` branch works with full optimization iff I completely remove the specialization `CGOpenMPRuntimeNVPTX::emitCriticalRegion`.
CHANGES SINCE LAST ACTION
More information about the Openmp-commits