[Openmp-commits] [PATCH] D65112: [OPENMP][NVPTX]Make the test compatible with CUDA9+, NFC.

Tue Jul 23 12:29:27 PDT 2019

Hahnfeld added a comment.

In D65112#1597805 <https://reviews.llvm.org/D65112#1597805>, @ABataev wrote:

> In D65112#1597804 <https://reviews.llvm.org/D65112#1597804>, @Hahnfeld wrote:
>
> > In D65112#1597722 <https://reviews.llvm.org/D65112#1597722>, @ABataev wrote:
> >
> > > In D65112#1597706 <https://reviews.llvm.org/D65112#1597706>, @Hahnfeld wrote:
> > >
> > > > In D65112#1597673 <https://reviews.llvm.org/D65112#1597673>, @ABataev wrote:
> > > >
> > > > > But simple removal does not help, actually. It still might produce incorrect results. When you removed the barrier, you introduced implicit threads divergence. Since cuda9+ threads are not executed in lock-step manner anymore (see https://devblogs.nvidia.com/using-cuda-warp-level-primitives/). It leads to the fact that the result you get is not stable and not guaranteed to be reproduced on other platforms by other users.
> > > > >  The runtime relies on the warp-synchronous model and threads in the warp after the critical region must be synchronized. It means, that we should not use barriers here but still need to synchronize threads within the warp. To synchronize the threads we must use `__syncwarp(mask)` function instead.
> > > > >  Currently, it is pure coincidence that the test passes. It happens just because the second parallel region requires much time to start and execute its body and the threads in the else branch have time to execute their code. But it is not guaranteed in Cuda9+.
> > > > >  To reveal this problem, just enclose the code in `else` branch (`Count += omp_get_level(); // 6 * 1 = 6`) under control of another `#pragma omp critical`.
> > > > >
> > > > >   #pragma omp critical
> > > > >   Count += omp_get_level(); // 6 * 1 = 6 
> > > > >   
> > > > >
> > > > > It must produce the same result as before but it won't, most probably.
> > > >
> > > >
> > > > I still get the correct results. Do you have a test that you know to fail?
> > >
> > >
> > > I get `Expected count = 67` with the critical section in this test. It is on Power9 with Cuda9. Did you try to compile it at O3 <https://reviews.llvm.org/owners/package/3/>?
> >
> >
> > At first I didn't, but now the original test case with added `critical` in the `else` branch works with full optimization iff I completely remove the specialization `CGOpenMPRuntimeNVPTX::emitCriticalRegion`.
>
>
> But again, it just masks the real problem but does not solve it. It is again just a pure coincidence that it returns the expected result.
>
>   for (int I = 0; I < 32; ++I) {
>    if (omp_get_thread_num() == I) {
>      #pragma omp critical
>      Count += omp_get_level(); // 6 * 1 = 6
>     }
>   }
>
>
> Again, it will fail though it must return correct result.

No, this loop also works with full optimization if surrounded by `target parallel for map(tofrom: Count)` and correctly returns the value 32. If you had different directives in mind, I'd kindly ask you to post a full example that you tested on your end.

I assume you are using the latest CUDA 9.2 with all fixes to the nasty synchronization issues that prior versions of CUDA 9 had?

Repository:
  rOMP OpenMP

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D65112/new/

https://reviews.llvm.org/D65112